Welcome to my page
This is the learning diary for the course ‘Introduction to Open Data Science 2020’. I would like to further develop my data science skills in R programming and get familiar with Github.
The project repository in GitHub: https://github.com/Imangholiloo/IODS-project
# Let's get the date and time
date()
## [1] "Fri Dec 11 00:23:47 2020"
alternatively, print(Sys.time())
# load package
library(rmarkdown)
In this exercise we will use data from a survey taken by students of a statistics course.
Read the data to R
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
alternatively, read it from your local drive lrn14 <- read.table("C:/Users/Mohammad/Documents/IODS-project/data/JYTOPKYS3-data.txt", sep="\t", header=TRUE) Or read it from your clipboard using: lrn14 <- readClipboard() metadata (description of data) is here (in Finnish): https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt
FYI: to keep characters as characters in R: use stringsAsFactors=FALSE also in the read file command
# check dimensions of the data
dim(lrn14)
## [1] 183 60
# check structure of the data
#str(lrn14)
As it shows, the data have 186 observations and 63 columns and data structure were also given (all are intiger, except gender which is character).
Task 3 Selecting columns
Name of group of columns (in List) realted to questions under the umbrella of ‘deep’, ‘surface’ and ‘strategic learning’
#install.packages('dplyr')
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30", "D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
# select the columns related to deep learning and create column 'deep' by averaging
deep_columns <- select(lrn14, one_of(deep_questions))
lrn14$deep <- rowMeans(deep_columns)
# select the columns related to surface_questions and create column 'surface' by averaging
surface_columns <- select(lrn14, one_of(surface_questions))
lrn14$surface <- rowMeans(surface_columns)
# select the columns related to strategic_questions and create column 'strategic' by averaging
strategic_columns <- select(lrn14, one_of(strategic_questions))
lrn14$strategic <- rowMeans(strategic_columns)
# select only specific needed columns, made recently
keep_columns <- c("gender","Age","Attitude", "deep", "strategic", "surface", "Points")
needed_columns_2014 <- select(lrn14, one_of(keep_columns))
# check structure of the new df
str(needed_columns_2014)
## 'data.frame': 183 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude : int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ strategic: num 3.38 2.75 3.62 3.12 3.62 ...
## $ surface : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
# Modfiy columns names, let's make all names smallcase (not capital)
colnames(needed_columns_2014)[2] <- "age"
colnames(needed_columns_2014)[3] <- "attitude"
colnames(needed_columns_2014)[7] <- "points"
#check if worked fine
str(needed_columns_2014)
## 'data.frame': 183 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude : int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ strategic: num 3.38 2.75 3.62 3.12 3.62 ...
## $ surface : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
#excluding rows (observations) with point in exam >0
needed_columns_2014 <- filter(needed_columns_2014, points > 0) #alternatively can be points *** != 0 ***
#let's check descriptive staticial summary of columns, for FYI
summary(needed_columns_2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## strategic surface points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
View(needed_columns_2014) #to view the table in R
Task 4
#set working dir
setwd("C:/Users/Mohammad/Documents/IODS-project/data/")
#Save the analysis dataset to the ‘data’ folder
write.csv(needed_columns_2014, "learning2014.csv")
#for the sake of being sure, read the data again!
learning2014_read_again <- read.csv("C:/Users/Mohammad/Documents/IODS-project/data/learning2014.csv")
head(learning2014_read_again)
## X gender age attitude deep strategic surface points
## 1 1 F 53 37 3.583333 3.375 2.583333 25
## 2 2 M 55 31 2.916667 2.750 3.166667 12
## 3 3 F 49 25 3.500000 3.625 2.250000 24
## 4 4 M 53 35 3.500000 3.125 2.250000 10
## 5 5 M 49 37 3.666667 3.625 2.833333 22
## 6 6 F 38 38 4.750000 3.625 2.416667 21
#to see the structure of the dateset
str(learning2014_read_again)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude : int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ strategic: num 3.38 2.75 3.62 3.12 3.62 ...
## $ surface : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014_read_again)
## X gender age attitude
## Min. : 1.00 Length:166 Min. :17.00 Min. :14.00
## 1st Qu.: 42.25 Class :character 1st Qu.:21.00 1st Qu.:26.00
## Median : 83.50 Mode :character Median :22.00 Median :32.00
## Mean : 83.50 Mean :25.51 Mean :31.43
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00
## Max. :166.00 Max. :55.00 Max. :50.00
## deep strategic surface points
## Min. :1.583 Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.667 Median :3.188 Median :2.833 Median :23.00
## Mean :3.680 Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.917 Max. :5.000 Max. :4.333 Max. :33.00
read the data
learning2014 <- read.csv("C:/Users/Mohammad/Documents/IODS-project/data/learning2014_MI.csv")
#Metadata can be found in https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt"
#alternatively by:
#learning2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header=TRUE, sep = ",")
Description about data:
This dateset is gather by Kimmo Vehkalahti in Introduction to Social Statistics, fall 2014, funded by Teachers Academy funding for him (2013-2015)
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
View(learning2014)
dim(learning2014) #this dataset contains 166 row and 8 columns
## [1] 166 8
colnames(learning2014) #column names are printed in consule
## [1] "X" "gender" "age" "attitude" "deep" "stra" "surf"
## [8] "points"
summary(learning2014) #a short view about descriptive staticial feastures of the columns, e.g. mean, min, max, etc
## X gender age attitude
## Min. : 1.00 Length:166 Min. :17.00 Min. :1.400
## 1st Qu.: 42.25 Class :character 1st Qu.:21.00 1st Qu.:2.600
## Median : 83.50 Mode :character Median :22.00 Median :3.200
## Mean : 83.50 Mean :25.51 Mean :3.143
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:3.700
## Max. :166.00 Max. :55.00 Max. :5.000
## deep stra surf points
## Min. :1.583 Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.667 Median :3.188 Median :2.833 Median :23.00
## Mean :3.680 Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.917 Max. :5.000 Max. :4.333 Max. :33.00
table(learning2014$gender) # to show the number of participants by gender"
##
## F M
## 110 56
table(learning2014$age)
##
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 40 42 44 45
## 1 3 7 25 31 20 12 7 10 6 3 2 4 5 3 1 3 3 2 4 2 1 1 1 1 1
## 48 49 50 53 55
## 1 2 1 2 1
We can see the number of participants by age.
Task 2
#graphical shows interactions (i.e. correlations and distrbutions) of columns (e.g. variables) of this dataset togehter
plot(learning2014[c(-1,-2)]) #[c(-1,-2)] >>> is to remove indexcolumn and gender (strings) elements of df, source: https://statisticsglobe.com/remove-element-from-list-in-r
hist(learning2014$age, col='grey') # to check histogram of a column (e.g. age)
A loop to automaticlly plot histogram of all variables NB: Check that variables are all numberical, otherwise in gender (male/female, this will stop)
colnames(learning2014[-2]) #[-2] is to remove an element of list, source: https://statisticsglobe.com/remove-element-from-list-in-r
## [1] "X" "age" "attitude" "deep" "stra" "surf" "points"
for (i in colnames(learning2014[-2])){
print(i)
hist(learning2014[[i]], col='grey')
}
## [1] "X"
## [1] "age"
## [1] "attitude"
## [1] "deep"
## [1] "stra"
## [1] "surf"
## [1] "points"
Visualize with ggplot
#install and load it
#install.packages("ggplot2")
library(ggplot2)
# setup plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender))
# select visualization type (points)
p2 <- p1 + geom_point()
# draw the plot
p2
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
# add a main title and draw the plot
p4 <- p3 + ggtitle("Student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'
Interpretation:
As graph shows, there is a positive strong correlation between student’s attitude and their exam points the correlation was more stronger with Male student as the slop is higher than female. So, attitude of male student indulents the exame point more than attitue of females
If so, read this section :)
Draw a scatter plot matrix
#pairs(learning2014[c(-1:-2)])# ___c(-1:-2)___, is to remove those column
pairs(learning2014[c(-1:-2)], col= c("black","red"))#learning2014$gender)
F: black, M: red
If yes, use GGally and ggplot2 libraries to create even more advanced plot matrix with ggpairs()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014[c(-1:-2)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
I hope you got facinated and liked it like me :)
The graph shows a (strong) negative correlation (r2= -324) between surface and deep learning2014
Task 3
Now that we got familiar with variables and their distribution and correction, it’s time to fit a regression model
Let’s create a scatter plot of points versus attitude variables
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# let's make a simple linear regression model between exam points and attitude variables
my_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Interpretation: This model shows that attitude has statistically significant relationship with the target variable, and can be used to predict the exam points, becuase Pr(>|t|) values are way below 0.05, thus, our model is statistically significant in the significance level of 95%, in fact, this case is even valid in the significance level of >99%, as the values are <0.01" Our model will be in fact this: points = 3.5255*attitude+11.6372, as linear model is y=ax+b, where a is coeffient of the variable x, and b is the intercept, both can be derive from model summary"
Let’s select variables f interest and plo them:
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this model now, we tried to predict the exam point using attitude, stra, and surf variables, but as the results show. only attitude has statistically significant relation with the exam model, thus, we shall change the other variables
Task 4
It seems that there is not significant relationship between exam point and stratigic learnign and surface. So we shall rmeove them
my_model3 <- lm(points ~ attitude + stra+ deep +age, data = learning2014)
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude + stra + deep + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9941 -3.0839 0.5037 3.5519 11.3298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.24374 3.57079 3.709 0.000286 ***
## attitude 3.53884 0.56538 6.259 3.37e-09 ***
## stra 1.05032 0.53652 1.958 0.051998 .
## deep -0.73226 0.74677 -0.981 0.328273
## age -0.08750 0.05303 -1.650 0.100866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.261 on 161 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2035
## F-statistic: 11.54 on 4 and 161 DF, p-value: 2.915e-08
As model summary shows, still other variables such as age, deep, stra are not significant, but stra is very close to significance level of 95%, as it is Pr(>|t|) value is just a bit higher than 0.05 (so, very close
Task 5
my_model4 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model4)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Interpretation
Same conclusion as above, stra is not statistically significant (within significance level of 95%), so we could either remove it from equation (preffered), or keep it for the purpose of this practical, to have multi-variable regression.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2)) #defines that plots go to 2*2 frame (joining plots together)
plot(my_model4, which = c(1,2,5))
Interpretation
In linear regression, we assume:
1. linear relationship between variables and target variable
2. error are distributed normally
3. Errors do not correlated
4. errors have constant variance
5. the size of given error does not depend of the explanatory variables.
So, we can still check them as following:
To explore the normality assumption, QQ-plot can help us. As it shows, there is a reasonable fit between the dots and the line. Thus, it was normally distrbuted.
To analyse the assumption of constant variance of errors (#4 above), we can check residuals vs fitted plot. It shows a reasonable random spead/distribution of dots around the line. So, this assumtion was also valid.
In other words, Residual vs Fitted plot confirm that the fitted model is appropriate becuase the variabilty of residual is not increasing with increase of fitted value.
Residuals vs Leverage plot helps identify which observation have an unusual high impact, like outlier. It seems that there is no such outlier in this data
more info about interpretations: https://vimeo.com/199820384
Let’s also check the residuales of the model using a boxplot. The closer to 0 the better the model. In other words, it shows the symmetry and specifically the normality of the error terms in the regression model.
boxplot(my_model4$residuals,
main = "Residuals of the model",
xlab = "Residual values",
ylab = "",
col = "orange",
border = "brown",
horizontal = F,
notch = F)
In this script we assess Student Performance Data Set from URL: https://archive.ics.uci.edu/ml/datasets/Student+Performance
Data description:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). For more info visit: https://archive.ics.uci.edu/ml/datasets/Student+Performance
# Read the data to R
# Set working directory
setwd("C:/Users/Mohammad/Documents/IODS-project/data")
# read math class questionnaire data into memory
student_mat=read.csv("student-mat.csv",sep=";",header=TRUE)
# Read portuguese class questionnaire data into memory
student_por=read.csv("student-por.csv",sep=";",header=TRUE)
# check dimensions of both data
dim(student_mat)
## [1] 395 33
dim(student_por)
## [1] 649 33
# check structure of both data
"str(student_mat)" # I comment these lines out to shorten the report
## [1] "str(student_mat)"
"str(student_por)"
## [1] "str(student_por)"
# check first 5 rows of both data
'head(student_mat)' # I comment these lines out to shorten the report
## [1] "head(student_mat)"
'head(student_por)'
## [1] "head(student_por)"
# check last 5 rows of both data
'tail(student_mat)' # I comment these lines out to shorten the report
## [1] "tail(student_mat)"
'tail(student_por)'
## [1] "tail(student_por)"
#a short view about descriptive statistical features of the columns, e.g. mean, min, max, etc.
summary(student_mat)
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:395 Length:395 Length:395 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
summary(student_por)
## school sex age address
## Length:649 Length:649 Min. :15.00 Length:649
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.74
## 3rd Qu.:18.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:649 Length:649 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Mode :character Median :2.000 Median :2.000
## Mean :2.515 Mean :2.307
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:649
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.569 Mean :1.931 Mean :0.2219
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:649 Length:649 Length:649 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.931
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.18 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
#column names are printed in console
colnames(student_mat)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
colnames(student_por)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
#to show the number of observations based on their school name
table(student_mat$school)
##
## GP MS
## 349 46
table(student_por$school)
##
## GP MS
## 423 226
Merge two datasets usig dplyr library
# load library
library(dplyr)
# make a list of common columns to be used as identifiers in both datasets when merging
join_by_columns <- c("school","sex","age","address","famsize","Pstatus","Medu",
"Fedu","Mjob","Fjob","reason","nursery","internet")
# join the two datasets by inner join function of the dplyr library, using the selected identifiers
# Note: inner_join keeps only rows (observations) that arein both input datasets
students_math_por_joint <- inner_join(student_mat, student_por,
by = join_by_columns, suffix = c("_math", "_por"))
#check dimention of newly merged dataset
dim(students_math_por_joint) # 382 students
## [1] 382 53
# see the new column names
colnames(students_math_por_joint)
## [1] "school" "sex" "age" "address"
## [5] "famsize" "Pstatus" "Medu" "Fedu"
## [9] "Mjob" "Fjob" "reason" "guardian_math"
## [13] "traveltime_math" "studytime_math" "failures_math" "schoolsup_math"
## [17] "famsup_math" "paid_math" "activities_math" "nursery"
## [21] "higher_math" "internet" "romantic_math" "famrel_math"
## [25] "freetime_math" "goout_math" "Dalc_math" "Walc_math"
## [29] "health_math" "absences_math" "G1_math" "G2_math"
## [33] "G3_math" "guardian_por" "traveltime_por" "studytime_por"
## [37] "failures_por" "schoolsup_por" "famsup_por" "paid_por"
## [41] "activities_por" "higher_por" "romantic_por" "famrel_por"
## [45] "freetime_por" "goout_por" "Dalc_por" "Walc_por"
## [49] "health_por" "absences_por" "G1_por" "G2_por"
## [53] "G3_por"
# view the file in R
#View(students_math_por_joint)
# glimpse at the data, it is like str() but prints more data. In other words, like a transposed version of print()
'glimpse(students_math_por_joint)' # I comment theis line out to shorten the
## [1] "glimpse(students_math_por_joint)"
# check data structure
'str(students_math_por_joint)' # I comment these lines out to shorten the report
## [1] "str(students_math_por_joint)"
Combine the ‘duplicated’ answers in the joined data
# create a new data frame with only the joined columns
alc <- select(students_math_por_joint, one_of(join_by_columns))
# columns that were not used for joining the data
notjoined_columns <- colnames(student_mat)[!colnames(student_mat) %in% join_by_columns]
# print out the columns not used for joining
notjoined_columns
## [1] "guardian" "traveltime" "studytime" "failures" "schoolsup"
## [6] "famsup" "paid" "activities" "higher" "romantic"
## [11] "famrel" "freetime" "goout" "Dalc" "Walc"
## [16] "health" "absences" "G1" "G2" "G3"
# for every column name not used for joining...
for(column_name in notjoined_columns) {
# print the column bane while looping to show the progress
print(column_name)
# select two columns from 'math_por' with the same original name
two_columns <- select(students_math_por_joint, starts_with(column_name))
# select the first column vector of those two columns
first_column <- select(two_columns, 1)[[1]]
# if that first column vector is numeric...
if(is.numeric(first_column)) {
# take a rounded average of each row of the two columns and
# add the resulting vector to the alc data frame
alc[column_name] <- round(rowMeans(two_columns))
} else { # else if it's not numeric...
# add the first column vector to the alc data frame
alc[column_name] <- first_column
}
}
## [1] "guardian"
## [1] "traveltime"
## [1] "studytime"
## [1] "failures"
## [1] "schoolsup"
## [1] "famsup"
## [1] "paid"
## [1] "activities"
## [1] "higher"
## [1] "romantic"
## [1] "famrel"
## [1] "freetime"
## [1] "goout"
## [1] "Dalc"
## [1] "Walc"
## [1] "health"
## [1] "absences"
## [1] "G1"
## [1] "G2"
## [1] "G3"
# glimpse at the new combined data
'glimpse(alc)' # I comment these lines out to shorten the report
## [1] "glimpse(alc)"
Define a new column alc_use by combining weekday and weekend alcohol use (i.e. columns Dalc and Walc, respectively)
alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
# load library for plotting
library(ggplot2)
# Let's plot the alcohol use based on gender
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
# define the plot as a bar plot and draw it
g1 + geom_bar()
# create a new logical column named 'high_use'. This column will make a True/Fale column if consuption is >2
alc <- mutate(alc, high_use = alc_use > 2)
# initialize a plot of 'high_use'
g2 <- ggplot(alc, aes(high_use))
# draw a bar plot of high_use by sex
g2 + facet_wrap("sex") + geom_bar()
table(alc$high_use)
##
## FALSE TRUE
## 268 114
"as it shows 114 students were consumed alcohol more than threshold (2) so were high use"
## [1] "as it shows 114 students were consumed alcohol more than threshold (2) so were high use"
table(alc$high_use)/nrow(alc)*100
##
## FALSE TRUE
## 70.15707 29.84293
"It means that 29.84% of students were high use, while majority (70.15%) were low use"
## [1] "It means that 29.84% of students were high use, while majority (70.15%) were low use"
table(alc$alc_use)/nrow(alc)*100
##
## 1 1.5 2 2.5 3 3.5 4
## 36.6492147 18.0628272 15.4450262 11.5183246 8.3769634 4.4502618 2.3560209
## 4.5 5
## 0.7853403 2.3560209
"as this shows, majoriy of them (36.65%) consumed just 1"
## [1] "as this shows, majoriy of them (36.65%) consumed just 1"
# Let's make a histogram and check the distribution
hist(alc$alc_use)
# Let's make a boxplot to further see the distribution
boxplot(alc$alc_use)
# read libraries
library(tidyr); library(dplyr); library(ggplot2)
# glimpse at the alc data
glimpse(alc)
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9,...
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, ...
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15,...
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Rows: 13,370
## Columns: 2
## $ key <chr> "school", "school", "school", "school", "school", "school", "...
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "...
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 156 11.4
## 2 F TRUE 42 11.7
## 3 M FALSE 112 12.2
## 4 M TRUE 72 10.3
Are you interested to create more plots, for example boxplot per groups?
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")
"So, as plots show, those who consumed alcohol highly, their grade reduced especially for males"
## [1] "So, as plots show, those who consumed alcohol highly, their grade reduced especially for males"
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
"As plots show, those who consumed alcohol highly, their absence were higher especially in males"
## [1] "As plots show, those who consumed alcohol highly, their absence were higher especially in males"
dim(alc)
## [1] 382 35
# Save files to your computer directory
write.csv(alc, "alc_joinet.csv")
write.csv(alc, "students_math_por_joint.csv")
Now that we prepared the data, let’s go for analysing the data
# read the data from your computer drive
data_student_perf_alc = read.csv("alc_joinet.csv")
#Alternatively you can read from the URL:
data_student_perf_alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=TRUE)
Data description:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). We joint the two dataset and now are ready to go for analysis phase.
Thus, we will use this dataset to analysis the relationships between high/low alcohol consumption and other variables. Is there any relationship between students performance and alcohol? Logistic regression will be used. ___For more info visit: https://archive.ics.uci.edu/ml/datasets/Student+Performance___
# Variables names are as following
colnames(data_student_perf_alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
I would like to select 4 interesting variables to start modelling. they are as following:
1. Internet: students with high Internet access at home, will consume less alcohol
2. goout: students who go out with friend more, will consume more alcohol
3. absences: students who are often absent from class, consume more alcohol
4. activities: students with higher extra-curricular activities will consume less alcohol
# select only those columns
variables <- c("internet", "goout", "absences", "activities",
"alc_use", "high_use")
dt_some_col =select(data_student_perf_alc, variables)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(variables)` instead of `variables` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
dim(dt_some_col)
## [1] 382 6
Draw a bar plot of each variable
gather(dt_some_col) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# see the proportional table
table(dt_some_col$absences)/nrow(dt_some_col)*100
##
## 0 1 2 3 4 5 6
## 30.3664921 0.7853403 17.5392670 1.5706806 13.3507853 1.0471204 7.8534031
## 7 8 9 10 11 12 13
## 2.0942408 5.4973822 0.7853403 4.4502618 0.5235602 2.6178010 0.7853403
## 14 15 16 17 18 19 20
## 2.8795812 0.7853403 1.5706806 0.2617801 1.0471204 0.2617801 0.5235602
## 21 22 23 24 25 26 28
## 0.2617801 0.7853403 0.2617801 0.2617801 0.2617801 0.2617801 0.2617801
## 30 54 56 75
## 0.2617801 0.2617801 0.2617801 0.2617801
#data structure
str(dt_some_col)
## 'data.frame': 382 obs. of 6 variables:
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
#Crosstable
cross_table = xtabs(~alc_use + internet, data = dt_some_col)
cross_table
## internet
## alc_use no yes
## 1 27 117
## 1.5 7 61
## 2 9 49
## 2.5 5 37
## 3 6 26
## 3.5 2 15
## 4 2 7
## 4.5 0 3
## 5 0 9
round(prop.table(cross_table), 2)
## internet
## alc_use no yes
## 1 0.07 0.31
## 1.5 0.02 0.16
## 2 0.02 0.13
## 2.5 0.01 0.10
## 3 0.02 0.07
## 3.5 0.01 0.04
## 4 0.01 0.02
## 4.5 0.00 0.01
## 5 0.00 0.02
#shows if distrubution is significantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cross_table
## X-squared = 6.0051, df = 8, p-value = 0.6467
As p-value > 0.05, thus the distrubution of this variable (internet access at home) is not significantly differnet
Note: don’t worry about the warning message, becuase the number of data is few.
For nother variable
#Crosstable
cross_table = xtabs(~alc_use + goout, data = dt_some_col)
cross_table
## goout
## alc_use 1 2 3 4 5
## 1 16 49 45 25 9
## 1.5 3 22 29 8 6
## 2 2 13 26 10 7
## 2.5 2 7 13 16 4
## 3 0 4 3 14 11
## 3.5 1 2 4 3 7
## 4 0 0 1 5 3
## 4.5 0 1 0 1 1
## 5 0 1 2 0 6
# to get proportion of each, with 2 decimal rounded
round(prop.table(cross_table), 2)
## goout
## alc_use 1 2 3 4 5
## 1 0.04 0.13 0.12 0.07 0.02
## 1.5 0.01 0.06 0.08 0.02 0.02
## 2 0.01 0.03 0.07 0.03 0.02
## 2.5 0.01 0.02 0.03 0.04 0.01
## 3 0.00 0.01 0.01 0.04 0.03
## 3.5 0.00 0.01 0.01 0.01 0.02
## 4 0.00 0.00 0.00 0.01 0.01
## 4.5 0.00 0.00 0.00 0.00 0.00
## 5 0.00 0.00 0.01 0.00 0.02
#if distrubution is signoficantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cross_table
## X-squared = 108.14, df = 32, p-value = 3.414e-10
As p-value < 0.05, thus the distrubution of this variable (going out behaviour) is significantly differnet
Note: don’t worry about the warning message, becuase the number of data is few.
Yet, for another variable
#Crosstable
cross_table = xtabs(~alc_use + absences, data = dt_some_col)
#cross_table
# to get proportion of each, with 2 decimal rounded
round(prop.table(cross_table), 2)
## absences
## alc_use 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 1 0.15 0.01 0.07 0.00 0.04 0.00 0.03 0.01 0.02 0.01 0.01 0.00 0.01 0.00
## 1.5 0.06 0.00 0.04 0.00 0.03 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.00
## 2 0.04 0.00 0.03 0.01 0.02 0.01 0.01 0.01 0.01 0.00 0.01 0.00 0.00 0.00
## 2.5 0.02 0.00 0.01 0.01 0.01 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.01 0.00
## 3 0.02 0.00 0.02 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
## 3.5 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
## 4 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 4.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 5 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## absences
## alc_use 14 15 16 17 18 19 20 21 22 23 24 25 26 28
## 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 1.5 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 2 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 2.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
## 3 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 3.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 4.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 5 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## absences
## alc_use 30 54 56 75
## 1 0.00 0.00 0.00 0.00
## 1.5 0.00 0.00 0.00 0.00
## 2 0.00 0.00 0.00 0.00
## 2.5 0.00 0.00 0.00 0.00
## 3 0.00 0.00 0.00 0.00
## 3.5 0.00 0.00 0.00 0.00
## 4 0.00 0.00 0.00 0.00
## 4.5 0.00 0.00 0.00 0.00
## 5 0.00 0.00 0.00 0.00
#if distrubution is signoficantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cross_table
## X-squared = 348.17, df = 248, p-value = 2.751e-05
As p-value < 0.05, thus the distribution of this variable (absence) is significantly different.
Note: don’t worry about the warning message, because the number of data is few.
Yet, for nother variable
#Crosstable
cross_table = xtabs(~alc_use + activities, data = dt_some_col)
cross_table
## activities
## alc_use no yes
## 1 58 86
## 1.5 40 28
## 2 24 34
## 2.5 22 20
## 3 18 14
## 3.5 10 7
## 4 4 5
## 4.5 1 2
## 5 4 5
# to get proportion of each, with 2 decimal rounded
round(prop.table(cross_table), 2)
## activities
## alc_use no yes
## 1 0.15 0.23
## 1.5 0.10 0.07
## 2 0.06 0.09
## 2.5 0.06 0.05
## 3 0.05 0.04
## 3.5 0.03 0.02
## 4 0.01 0.01
## 4.5 0.00 0.01
## 5 0.01 0.01
#if distrubution is signoficantly differnet
chisq.test(cross_table)
## Warning in chisq.test(cross_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cross_table
## X-squared = 9.9466, df = 8, p-value = 0.2688
As p-value > 0.05, thus the distribution of this variable (extra-curricular activities) is not significantly different.
Note: don’t worry about the warning message, because the number of data is few.
Let’s barplot the variables of interest
barplot(sort(table(dt_some_col$internet),decreasing=T))
barplot(sort(table(dt_some_col$goout),decreasing=T))
barplot(sort(table(dt_some_col$absences),decreasing=T))
barplot(sort(table(dt_some_col$activities),decreasing=T))
barplot(sort(table(dt_some_col$alc_use),decreasing=T))
barplot(sort(table(dt_some_col$high_use),decreasing=T))
Let’s boxplot the numerical variables of interest
boxplot(dt_some_col$goout)
boxplot(dt_some_col$absences)
boxplot(dt_some_col$alc_use)
And some more
plot(alc_use ~ goout, data=dt_some_col)
plot(alc_use ~ absences, data=dt_some_col)
Alternatively, this code also does the same: “table(dt_some_col\(activities, dt_some_col\)alc_use)”
## Apply logistic regression model
#set binary variables as factor
dt_some_col$internet <- factor(dt_some_col$internet)
dt_some_col$activities <- factor(dt_some_col$activities)
# make the model with glm()
regres_model <- glm(high_use ~ internet + goout + absences + activities, data = dt_some_col, family = "binomial")
# print out a summary of the model
summary(regres_model)
##
## Call:
## glm(formula = high_use ~ internet + goout + absences + activities,
## family = "binomial", data = dt_some_col)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0141 -0.7760 -0.5282 0.9018 2.3876
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.47384 0.51472 -6.749 1.49e-11 ***
## internetyes -0.08489 0.35507 -0.239 0.811041
## goout 0.76811 0.11960 6.422 1.34e-10 ***
## absences 0.06175 0.01701 3.629 0.000284 ***
## activitiesyes -0.44819 0.25086 -1.787 0.074000 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 393.79 on 377 degrees of freedom
## AIC: 403.79
##
## Number of Fisher Scoring iterations: 4
Based on model summary absence is statistically significant relationship with high alcohol use, because the probability value Pr(>|z|) = 0.000284, thus the model is valid in >99% significance level.
Students with hight going out behaviour, consumed more alcohol. It is statistically proved because Pr(>|z|) = 1.34e-10, thus our hypothesis is accepted in >99% significance level.
However, having Internet access seemed not to have relationship with those who had the Internet at home. Because the Pr(>|z|) for this was >0.05, thus is not statistically significant (at least in 95%). Thus, our hypothesis that we assumed those who have Internet at home will not consume more alcohol, would be rejected.
It is the same as our conclusion of chisq.test explained above!
Notably, the statistical relationship between having more extra-curriculum activities and alcohol consumption was not valid in 95% significance level, however in 90% because the P value is 0.074. So, it shall be rejected, and we can ignore this variable.
Hence, in fact our model would be
alcohol consumption = goingout0.76811 + absence0.06175 + (-3.47384), becuase y = ax1 + bx2 + c
# print out the coefficients of the model
coef(regres_model)
## (Intercept) internetyes goout absences activitiesyes
## -3.47383750 -0.08489077 0.76810971 0.06174629 -0.44819206
# compute odds ratios (OR)
OR <- coef(regres_model) %>% exp
# compute confidence intervals (CI)
CI <- confint(regres_model) %>% exp
## Waiting for profiling to be done...
# bind odds ratios (OR) and confidence intervals (CI) columns together
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03099785 0.01084108 0.08194402
## internetyes 0.91861262 0.46457217 1.88377842
## goout 2.15568752 1.71547363 2.74450680
## absences 1.06369244 1.03125278 1.10286289
## activitiesyes 0.63878199 0.38885454 1.04178158
As it shows, two variables of going out and absence has high odds ratio (same as probability in non-logistic regression models). And their confidence interval is still good because it’s very close interval, so the most of values in the variable are like that, thus with high OR value.
# predict() the probability of high_use
probabilities <- predict(regres_model, type = "response")
# add the predicted probabilities to 'dt_some_col'
dt_some_col <- mutate(dt_some_col, probability = probabilities)
# use the probabilities to make a prediction of high_use
dt_some_col <- mutate(dt_some_col, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(dt_some_col, internet, goout, absences, activities, probability, prediction) %>% tail(10)
## internet goout absences activities probability prediction
## 373 no 2 0 no 0.12590977 FALSE
## 374 no 3 14 no 0.42432092 FALSE
## 375 no 3 2 no 0.25999091 FALSE
## 376 yes 3 7 yes 0.21919441 FALSE
## 377 yes 2 0 yes 0.07793785 FALSE
## 378 yes 4 0 no 0.38076807 FALSE
## 379 no 1 0 yes 0.04093710 FALSE
## 380 no 1 0 yes 0.04093710 FALSE
## 381 yes 5 3 no 0.61468746 TRUE
## 382 yes 1 0 no 0.05783324 FALSE
# tabulate the target variable versus the predictions
table(high_use = dt_some_col$high_use, prediction = dt_some_col$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 23
## TRUE 71 41
Now, 2x2 cross tabulation of predictions versus the actual values
As it shows, 247 cases of high_use alcohol consumptions were false and were predicted correctly as false, 71 of them were miscalssified.
### Extra coding 3: Create a nice graph with confusion matrix
# insprited from https://stackoverflow.com/a/64539733
library('caret')
## Loading required package: lattice
cm <- confusionMatrix(factor(dt_some_col$prediction), factor(dt_some_col$high_use), dnn = c("Prediction", "Reference"))
ggplot(as.data.frame(cm$table), aes(Prediction,sort(Reference,decreasing = T), fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009193") +
labs(x = "Prediction",y = "Reference") +
scale_x_discrete(labels=c("False","True")) +
scale_y_discrete(labels=c("True","False"))
Good plots!
As explained above, it shows that 247 cases of high_use alcohol consumptions were false and were predicted correctly as false, 71 of them were miscalssifed.
# initialize a plot of 'high_use' versus 'probability' in 'dt_some_col'
g <- ggplot(dt_some_col, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = dt_some_col$high_use, prediction = dt_some_col$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64659686 0.06020942 0.70680628
## TRUE 0.18586387 0.10732984 0.29319372
## Sum 0.83246073 0.16753927 1.00000000
Define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dt_some_col$high_use, prob = dt_some_col$probability)
## [1] 0.2460733
Perform leave-one-out cross-validation and print out the mean prediction error for the testing data. It divides the observations into k number of folds, leaves-one-out for validation and uses the rest for training, for example, if k=5, uses 4-folds for training and 1-fold for validation
#load library
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
# cv.glm function from the 'boot' library computes the error and stores it in delta
cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = 2)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2565445
k = c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25)
for (i in k){
print(i)
cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = i)
# average number of wrong predictions in the cross validation
print (cv$delta[1])}
## [1] 2
## [1] 0.2486911
## [1] 3
## [1] 0.2591623
## [1] 4
## [1] 0.2565445
## [1] 5
## [1] 0.2565445
## [1] 6
## [1] 0.2617801
## [1] 7
## [1] 0.2513089
## [1] 8
## [1] 0.2513089
## [1] 9
## [1] 0.2513089
## [1] 10
## [1] 0.2617801
## [1] 11
## [1] 0.2670157
## [1] 12
## [1] 0.2539267
## [1] 13
## [1] 0.2460733
## [1] 14
## [1] 0.2591623
## [1] 15
## [1] 0.2591623
## [1] 20
## [1] 0.2643979
## [1] 25
## [1] 0.2486911
So, the accuracy of model improved by hyper tuning using cross-validation values in loop (above), because the, average number of wrong predictions (shown by cv$delta[1] in above) decreased a bit.
## Super-Bonus: Create logistic regression with other variables
I like to add two variables such as romantic (weather or not in a romantic relationship) and famrel (the quality of family relationship).
I keep the previously significant variables
colnames(data_student_perf_alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
variables <- c("romantic", "goout", "absences", "famrel",
"alc_use", "high_use")
dt_some_col =select(data_student_perf_alc, variables)
dim(dt_some_col)
## [1] 382 6
# draw a bar plot of each variable
gather(dt_some_col) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
table(dt_some_col$absences)/nrow(dt_some_col)*100
##
## 0 1 2 3 4 5 6
## 30.3664921 0.7853403 17.5392670 1.5706806 13.3507853 1.0471204 7.8534031
## 7 8 9 10 11 12 13
## 2.0942408 5.4973822 0.7853403 4.4502618 0.5235602 2.6178010 0.7853403
## 14 15 16 17 18 19 20
## 2.8795812 0.7853403 1.5706806 0.2617801 1.0471204 0.2617801 0.5235602
## 21 22 23 24 25 26 28
## 0.2617801 0.7853403 0.2617801 0.2617801 0.2617801 0.2617801 0.2617801
## 30 54 56 75
## 0.2617801 0.2617801 0.2617801 0.2617801
barplot(sort(table(dt_some_col$romantic),decreasing=T))
barplot(sort(table(dt_some_col$goout),decreasing=T))
barplot(sort(table(dt_some_col$absences),decreasing=T))
barplot(sort(table(dt_some_col$famrel),decreasing=T))
barplot(sort(table(dt_some_col$alc_use),decreasing=T))
barplot(sort(table(dt_some_col$high_use),decreasing=T))
#set binary variables as factor
dt_some_col$romantic <- factor(dt_some_col$romantic)
# find the model with glm()
regres_model <- glm(high_use ~ romantic + goout + absences + famrel, data = dt_some_col, family = "binomial")
# print out a summary of the model
summary(regres_model)
##
## Call:
## glm(formula = high_use ~ romantic + goout + absences + famrel,
## family = "binomial", data = dt_some_col)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9599 -0.7789 -0.5080 0.8454 2.4681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.15491 0.63232 -3.408 0.000655 ***
## romanticyes -0.37335 0.27759 -1.345 0.178641
## goout 0.79978 0.12171 6.571 4.99e-11 ***
## absences 0.06001 0.01641 3.656 0.000256 ***
## famrel -0.41049 0.13511 -3.038 0.002380 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 386.46 on 377 degrees of freedom
## AIC: 396.46
##
## Number of Fisher Scoring iterations: 4
Based on model summary the quality of family relationship (famrel) has statistically significant relationship with high alcohol use, because the probability value Pr(>|z|) = 0.002380, thus the model is valid in >99% significance level.
Students with hight going out behaviour, consumed more alcohol. It is statistically proved becuase Pr(>|z|) = 1.34e-10, thus our hypothesis is accepted in >99% significance level.
Thus, this model would in fact be model be like this:
alcohol consumption = goingout0.79978 + absence0.06001 + romanticyes * -0.37335 + (-2.15491), becuase y = ax1 + bx2 + cx3 + c
# print out the coefficients of the model
coef(regres_model)
## (Intercept) romanticyes goout absences famrel
## -2.15490562 -0.37335072 0.79978011 0.06000845 -0.41048560
# compute odds ratios (OR)
OR <- coef(regres_model) %>% exp
# compute confidence intervals (CI)
CI <- confint(regres_model) %>% exp
## Waiting for profiling to be done...
# bind odds ratios (OR) and confidence intervals (CI) columns together
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1159141 0.03245399 0.3900847
## romanticyes 0.6884237 0.39504416 1.1763720
## goout 2.2250516 1.76429361 2.8462267
## absences 1.0618455 1.03000291 1.0997256
## famrel 0.6633281 0.50735265 0.8632692
# predict() the probability of high_use
probabilities <- predict(regres_model, type = "response")
# add the predicted probabilities to 'dt_some_col'
dt_some_col <- mutate(dt_some_col, probability = probabilities)
# use the probabilities to make a prediction of high_use
dt_some_col <- mutate(dt_some_col, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(dt_some_col, romantic , goout , absences , famrel, probability, prediction) %>% tail(10)
## romantic goout absences famrel probability prediction
## 373 no 2 0 4 0.09999431 FALSE
## 374 no 3 14 5 0.27530426 FALSE
## 375 no 3 2 5 0.15604215 FALSE
## 376 yes 3 7 4 0.20573973 FALSE
## 377 no 2 0 5 0.06863981 FALSE
## 378 no 4 0 4 0.35486376 FALSE
## 379 no 1 0 1 0.14608898 FALSE
## 380 no 1 0 1 0.14608898 FALSE
## 381 no 5 3 2 0.76906675 TRUE
## 382 no 1 0 4 0.04755851 FALSE
# tabulate the target variable versus the predictions
table(high_use = dt_some_col$high_use, prediction = dt_some_col$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 251 19
## TRUE 66 46
As shown above, 251 cases of high_use alcohol consumptions were false and were predicted correctly as false, 66 of them were miscalssifed, so the accuracy improved.
Moreover, the numbers in diagonal (251 and 46) were increased compared to above model, which shoes the increase of the corrected predicted values, and hence, increase in model accuracy
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dt_some_col$high_use, prob = dt_some_col$probability)
## [1] 0.2225131
K-fold cross-validation
As explained also above, let’s perform leave-one-out cross-validation and print out the mean prediction error for the testing data
It divids the observations into k number of folds, leaves-one-out for validation and uses the rest for training, for example, if k=5, uses 4 folds for training and 1 fold for validation
#load library
library(boot)
# cv.glm function from the 'boot' library computes the error and stores it in delta
cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = 2)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2277487
k = c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25)
for (i in k){
print(i)
cv <- cv.glm(data = dt_some_col, cost = loss_func, glmfit = regres_model, K = i)
# average number of wrong predictions in the cross validation
print (cv$delta[1])}
## [1] 2
## [1] 0.2303665
## [1] 3
## [1] 0.2408377
## [1] 4
## [1] 0.2225131
## [1] 5
## [1] 0.2251309
## [1] 6
## [1] 0.2303665
## [1] 7
## [1] 0.2303665
## [1] 8
## [1] 0.2277487
## [1] 9
## [1] 0.2408377
## [1] 10
## [1] 0.2303665
## [1] 11
## [1] 0.2146597
## [1] 12
## [1] 0.2251309
## [1] 13
## [1] 0.2198953
## [1] 14
## [1] 0.2277487
## [1] 15
## [1] 0.2225131
## [1] 20
## [1] 0.2251309
## [1] 25
## [1] 0.2251309
So, the accuracy of crossvalidation improved by adding the new variable (quality of family relation), because the average number of wrong predictions (shown by cv$delta[1] in above) decreased
#load the required library
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#load the data
data("Boston")
Data description:
This dataset is about housing values in suburbs of Boston. More info can be found in: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
#explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#summary of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The summary shows minimum, mean, median, maximun and first and thrid quantiles for each columns of the dataset
#check dimention of the data
dim(Boston)
## [1] 506 14
As it shows, this dataset has 506 rows (observations) and 14 columns (variables)
# See head (fist 5 rows) of the file
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# See tail (last 5 rows) of the file
tail(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 501 0.22438 0 9.69 0 0.585 6.027 79.7 2.4982 6 391 19.2 396.90 14.33
## 502 0.06263 0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67
## 503 0.04527 0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08
## 504 0.06076 0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64
## 505 0.10959 0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48
## 506 0.04741 0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88
## medv
## 501 16.8
## 502 22.4
## 503 20.6
## 504 23.9
## 505 22.0
## 506 11.9
#see column names
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
# plot matrix of the variables
pairs(Boston)
You can see pair-wise plots of each variable (column) with each other and how they were distributed.
library(ggplot2)
library(GGally)
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
Still some more plots to that show correction of each column (variable) with others and the distribution.
Let’s make also boxplot to visually see the distribution, min, max and quantiles and mean of each variable
boxplot(Boston,
main = "Boxplot of all columns of Boston data",
xlab = "variable name",
ylab = "",
col = "orange",
border = "brown",
horizontal = F,
notch = F)
As the graph shows, the column “tax” has highest variation then black. The values in other columns have quite less variation!
Let’s see the strcutre of data even better using glimpse
# read libraries
library(tidyr); library(dplyr); library(ggplot2)
# glimpse at the alc data
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(Boston) %>% glimpse
## Rows: 7,084
## Columns: 2
## $ key <chr> "crim", "crim", "crim", "crim", "crim", "crim", "crim", "crim...
## $ value <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829...
# draw a bar plot of each variable
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
The graph shows barplot of each column.
Let’s investigate any possible correlctions within data calculate the correlation matrix and round it.
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
"cor_matrix" # I comment this line out to shorten the report
## [1] "cor_matrix"
inspired from http://www.htmlwidgets.org/showcase_datatables.html
library('DT')
datatable(cor_matrix, options = list(pageLength = 10))
Or alternatively set the options to be limiting by ht enumber of rows of the input df. datatable(cor_matrix, options = list(nrow(cor_matrix)))
visualize the correlation matrix
library("corrplot")
## corrplot 0.84 loaded
corrplot(cor_matrix, method="circle", type="upper",
cl.pos="b", tl.pos="d", tl.cex = 0.7)
The graph shows the correlation between variables, the larger the circle (and the darker the color), the higher the correlation is, and color shows if the correlation is positive or negative." "for example, the correlation between nox and dis is strong and negative, ie the increase of one variable, decreases the other one. Or another example is rad and tax, which have strong positive correlation.
On the other hand, chas does not have good correlation with other variables.
Task 4 Let’s standardize the variables suing scale funcation
boston_scaled <- scale(Boston)
As center = True (by defult), now, this scale is done now by cantering is done by subtracting the column means (omitting NAs) of x from their corresponding columns.
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
So, the variables were standardize as explained above. So, the mean of each column were converted to 0 and other values were scales based on that. So, mean of all columns is 0.
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
As you can see, the data class is “matrix” “array”, so we will change the object to dataframe.
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
#head(boston_scaled)
boxplot(boston_scaled,
main = "Boxplot of all columns of scaled Boston data",
xlab = "variable name",
ylab = "",
col = "orange",
border = "brown",
horizontal = F,
notch = F)
So, now the distribution of data is shown better. Yaaay.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
As instructed, we shall now use quantile vector of column crim to create a categorical variable of the crime rate in the dataset.
So, let’s start.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
You can see the quantiles of the column called crim, for example minimum is -0.419366929 and first quantile is -0.410563278.
Now, we can catergorise the continuous (float) values based on the quantiles of data (as instructed).
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
cut() divides the range of x into intervals and codes the values in x according to which interval they fall.
Now, we labelled the crim rate to be “low”, “med_low”, “med_high” or “high” based on the quantile values.
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Now, you can see that the number of observation divided/categorised to low is 127, medium low: 126, medium high 126 and high 127 cases/observation.
#we can even plot them
plot(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows as train dataset
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set by selecting those observations (rows) that were not in the training data (using -ind)
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Task 5: Fit a linear discriminant analysis model on the train set.
lda.fit <- lda(crime ~ ., data = train)
This uses all variables (columns) of the train dataset to predict crime (the categorised column).
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2673267 0.2475248 0.2574257 0.2277228
##
## Group means:
## zn indus chas nox rm age
## low 1.0922662 -0.9473815 -0.12651054 -0.8879308 0.4484884 -0.8868819
## med_low -0.1000596 -0.3622815 0.04263895 -0.5728114 -0.1460886 -0.3064032
## med_high -0.3783982 0.1720239 0.18195173 0.3798530 0.1028986 0.4116884
## high -0.4872402 1.0173395 -0.01556166 1.0381126 -0.3359546 0.7961204
## dis rad tax ptratio black lstat
## low 0.9393971 -0.6915643 -0.7467095 -0.44051074 0.37341910 -0.778052225
## med_low 0.3776243 -0.5534930 -0.5745746 -0.07553705 0.31152649 -0.100148030
## med_high -0.3860206 -0.4120549 -0.3134247 -0.28547305 0.08316151 -0.001425366
## high -0.8476244 1.6358846 1.5124511 0.77816452 -0.90450477 0.823739094
## medv
## low 0.530216130
## med_low 0.008067069
## med_high 0.167473433
## high -0.637627671
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06602024 0.77054979 -0.8819132
## indus 0.04692297 -0.23020986 0.3119031
## chas -0.06229522 -0.07756859 0.1117917
## nox 0.33026807 -0.54054089 -1.3476206
## rm -0.08774337 -0.08988796 -0.1739879
## age 0.20495695 -0.25903560 -0.1891056
## dis -0.09345475 -0.17753200 0.2195437
## rad 2.95045098 1.19344627 0.5035884
## tax 0.19302578 -0.35188594 -0.0436914
## ptratio 0.11049391 0.06184702 -0.2927775
## black -0.12413626 -0.02283732 0.1089013
## lstat 0.24231237 -0.27627984 0.4346493
## medv 0.19093873 -0.37447541 -0.1820459
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9397 0.0453 0.0150
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Predict:
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 7 11 1 0
## med_low 5 11 10 0
## med_high 0 8 13 1
## high 0 0 0 35
insprited from https://stackoverflow.com/a/64539733
library('caret')
cm <- confusionMatrix(factor(lda.pred$class), factor(correct_classes), dnn = c("Prediction", "Reference"))
ggplot(as.data.frame(cm$table), aes(Prediction,sort(Reference,decreasing = T), fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009193") +
labs(x = "Reference",y = "Prediction") +
scale_x_discrete(labels=c('low','med_low','med_high','high')) +
scale_y_discrete(labels=c('high', 'med_high', 'med_low', 'low'))
Overall accuracy shows the overall number of correctly classified labels over the incorrect ones.
code inspritred from https://stackoverflow.com/a/24349171
all_accuracies= cm$overall
all_accuracies
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.470588e-01 5.173502e-01 5.461983e-01 7.391350e-01 3.431373e-01
## AccuracyPValue McnemarPValue
## 3.863064e-10 NaN
all_accuracies['Accuracy']*100
## Accuracy
## 64.70588
all_accuracies['Kappa']*100
## Kappa
## 51.73502
The confusing matirx (graph) shows that for example 25 of class high were predicted correctly as high, but one of the misclassified to med_high.
As the confusion matrix shows, most f high were classified to high.
Task 7: Reload the dataset and calculate the distances between the observations.
# load MASS and Boston
library(MASS)
data('Boston')
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
As you can see, mean of the distance between variables is 342.899 (2.02 and 1198.26 as min and max, respectively).
Let’s cluster dataset to 3 classes, in a way that the homogeneity within cluster would be more than between clusters. Clusters are identified by the assessment of the relative distances between points, and, in this example, the relative homogeneity of each cluster and the degree of separation between the clusters makes the task very simple (Source: Multivariate Analysis for the Behavioral Sciences).
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
These pairs of plots show if we cluster the dataset to 3 clusters, then how differentically/discriminately can the variables be from each other.
For example, tax and rad, tax and black, or tax and lstat columns were cluster-able visually. Interestingly, these columns were shown some high correlation between each other.
Another example, the correlation between tax and rad is 0.91 (see below). Therefore, I plot the correlation again in below. FYI: we can plot them in 3D to see visually better. More info: check bonus exercises.
cor(Boston$tax, Boston$rad)
## [1] 0.9102282
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper",
cl.pos="b", tl.pos="d", tl.cex = 0.7)
# Investigate the optimal number of clusters
set.seed(1)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line', xlab= "number of clusters", ylab="Within groups sum of squares")
As the results show in the figure, the total within sum of squares declines sharply by increasing the number of K’s, but after 5, is gets almost steady, so something like 4 or 5 number of clusters is enough.
# k-means clustering
km <-kmeans(Boston, centers = 2) #number of clusters
km
## K-means clustering with 2 clusters of sizes 369, 137
##
## Cluster means:
## crim zn indus chas nox rm age dis
## 1 0.3887744 15.58266 8.420894 0.07317073 0.5118474 6.388005 60.63225 4.441272
## 2 12.2991617 0.00000 18.451825 0.05839416 0.6701022 6.006212 89.96788 2.054470
## rad tax ptratio black lstat medv
## 1 4.455285 311.9268 17.80921 381.0426 10.41745 24.85718
## 2 23.270073 667.6423 20.19635 291.0391 18.67453 16.27226
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
## 501 502 503 504 505 506
## 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 2868770 2896224
## (between_SS / total_SS = 70.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Here I analyse different number of clusters by changing the centers = 2) and report below the Within cluster sum of squares by cluster.
By looking at the code, I understood that within cluster sum of squares by cluster is calculated by dividing between_SS to total_SS. So, let’s write that to our loop to get that printed. Hence, I wrote the code to print this as it loops.
centers_list = c(2, 3,4,5,6,7,8,9,10,15,20)
for (i in centers_list){
print('.........')
print(i)
km <-kmeans(Boston, centers = i)
#print(km)
print(km$betweenss/km$totss*100)
}
## [1] "........."
## [1] 2
## [1] 70.28516
## [1] "........."
## [1] 3
## [1] 84.18386
## [1] "........."
## [1] 4
## [1] 90.64774
## [1] "........."
## [1] 5
## [1] 79.60841
## [1] "........."
## [1] 6
## [1] 93.57224
## [1] "........."
## [1] 7
## [1] 94.17725
## [1] "........."
## [1] 8
## [1] 82.01612
## [1] "........."
## [1] 9
## [1] 95.79905
## [1] "........."
## [1] 10
## [1] 82.72624
## [1] "........."
## [1] 15
## [1] 97.35675
## [1] "........."
## [1] 20
## [1] 97.86162
The above number are the number of cluster, and within cluster sum of squares by cluster.
Interpretaion Althogh the accuracy improves as the number of clusters increases, it seems 4 clusters to be fine, because accuracy improved dramatically until there, then get almost steady.
Notably, I think this depends on application and aim of each project. For example, my personal experience in forest mapping using aerial imagery, the increase of number of clusters, often led to horrible maps, thus somewhere between 3-5 is good.
Thus, as explained above, we can use 4 number of clusters to not get too complicated and doable in this exercise. I run this clustering again to get it saved in R object.
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
As figure shows, and explained above, using some columns like tax and black, tax and dist are easily separable.
inspired from https://www.r-bloggers.com/2020/05/practical-guide-to-k-means-clustering/ This can be considered as an alternative way.
wssplot <- function(Boston, nc=15, seed=1234){
wss <- (nrow(Boston)-1)*sum(apply(Boston,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(Boston, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
# plotting values for each cluster starting from 1 to 9
wssplot(Boston, nc = 9)
Interpretation: As the plot shows, the total within-cluster sum of square declines as number of clusters increases. However, gets steady after 5, Thus, selecting 4 or 5 number of clusters sounds optimal.
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
my_data <- scale(Boston)
clus_plot = fviz_nbclust(my_data, kmeans, method = "wss") #method = "wss" is total within sum of square
clus_plot
fviz_nbclust function determines and visualize the optimal number of clusters using different methods: within cluster sums of squares.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
library("plotly")
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library("colorspace")
#red <- hex(HLS(0, 0.5, 1))
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2,
z = matrix_product$LD3, type= 'scatter3d', mode='markers',
color = train$crime, colors = c("blue", "yellow", "black", "red"))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Please feel free to Zoom on and out on the 3D figure :) As you can see, we can see and distingush the 4 classes by zooming in the above 3D figure.
require(plot3D)
## Loading required package: plot3D
attach(mtcars)
## The following object is masked from package:ggplot2:
##
## mpg
scatter3D(x = matrix_product$LD1, y = matrix_product$LD2,
z = matrix_product$LD3, pch = 18, cex = 2,
theta = 15, phi = 20, #by chaning the theta and phi, change your view angle to 3d plot
ticktype = "detailed",
xlab = "LD1", ylab = "LD2", zlab = "LD3", clab = "",
colkey = list(length = 0.8, width = 0.4),
main = "3D scatter of clusters", col=c("blue", "yellow", "black", "red"))
Technical note: by chaning the theta and phi, change your view angle to 3d plot.
From this view, we can see that it’s clear to discriminate two cluster from this view (one cluster in right side and other one in left side), however, if we were able to rotate this plot, we could rotate and see other side to discuss more about the 4 possible clusters visually. I made it possible in below extra coding.
inspired from http://www.sthda.com/english/wiki/impressive-package-for-3d-and-4d-graph-r-software-and-data-visualization
# Make the rgl version
library("plot3Drgl")
## Loading required package: rgl
plotrgl()
Please feel free to Zoom on and out on the 3D figure :)
Technical note: You should run your 3d plot then run this code to plot it in interactive zoomable way.
We can see that it is very easy to make 3-4 clusters visually.
FYI: You can even plot a histogram of all columns in 3D and zoom in it: hist3D(z=as.matrix(test))
# plot the lda results
plot(lda.fit, dimen = 3, col = classes, pch = classes)
So, we can see that it is very easy to make 2 clusters visually. and even possible to 4 clusters.
inspired from https://www.r-bloggers.com/2020/05/practical-guide-to-k-means-clustering/
#Related to bouns 2: coloring
library(factoextra)
fviz_cluster(km, data = Boston, main = "Partitioning Clustering Plot")
Technical note: fviz_cluster provides ggplot2-based elegant visualization of partitioning methods including kmeans.
Since the graph is 2D, we cannot rotate it or zoom to see other views of the graph. But generally, it looks good!
options(knitr.duplicate.label = "allow")
#to sovel following error (mentioned below), I put this chunk to run and solved the error:
#error: Error in parse_block(g[-1], g[1], params.src, markdown_mode) : Duplicate chunk label 'setup', which has been used for the chunk: knitr::opts_chunk$set(echo = TRUE) Calls: <Anonymous> ... process_file -> split_file -> lapply -> FUN -> parse_block
#insprited from: https://bookdown.org/yihui/rmarkdown-cookbook/duplicate-label.html
My GitHub repository is https://github.com/Imangholiloo/IODS-project
Nowadays, it’s common to have huge datasets (like Big Data) in which there could be multiple variables with high correlation with each other, or noise within the data. Meaning that, one variable can explain more of the other one. Thus, it is good to reduce the high dimensionality of our data.
There are many available methods, but we will learn about principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. And Multiple correspondence analysis (MCA) which gives us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. More info: https://vimeo.com/204164956
#Read the “Human development” file into R
hd <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human_development.csv", stringsAsFactors = F)
#Read the “Gender inequality” file into R
gii <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/gender_inequality.csv", stringsAsFactors = F, na.strings = "..")
For data description and more info please visit: http://hdr.undp.org/en/content/human-development-index-hdi and http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf
#strcutre of data
str(hd)
## 'data.frame': 195 obs. of 8 variables:
## $ HDI.Rank : int 1 2 3 4 5 6 6 8 9 9 ...
## $ Country : chr "Norway" "Australia" "Switzerland" "Denmark" ...
## $ Human.Development.Index..HDI. : num 0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
## $ Life.Expectancy.at.Birth : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Expected.Years.of.Education : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Mean.Years.of.Education : num 12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
## $ Gross.National.Income..GNI..per.Capita: chr "64,992" "42,261" "56,431" "44,025" ...
## $ GNI.per.Capita.Rank.Minus.HDI.Rank : int 5 17 6 11 9 11 16 3 11 23 ...
#dimension of file (row* column)
dim(hd)
## [1] 195 8
#summary of file
summary(hd)
## HDI.Rank Country Human.Development.Index..HDI.
## Min. : 1.00 Length:195 Min. :0.3480
## 1st Qu.: 47.75 Class :character 1st Qu.:0.5770
## Median : 94.00 Mode :character Median :0.7210
## Mean : 94.31 Mean :0.6918
## 3rd Qu.:141.25 3rd Qu.:0.8000
## Max. :188.00 Max. :0.9440
## NA's :7
## Life.Expectancy.at.Birth Expected.Years.of.Education Mean.Years.of.Education
## Min. :49.00 Min. : 4.10 Min. : 1.400
## 1st Qu.:65.75 1st Qu.:11.10 1st Qu.: 5.550
## Median :73.10 Median :13.10 Median : 8.400
## Mean :71.07 Mean :12.86 Mean : 8.079
## 3rd Qu.:76.80 3rd Qu.:14.90 3rd Qu.:10.600
## Max. :84.00 Max. :20.20 Max. :13.100
##
## Gross.National.Income..GNI..per.Capita GNI.per.Capita.Rank.Minus.HDI.Rank
## Length:195 Min. :-84.0000
## Class :character 1st Qu.: -9.0000
## Mode :character Median : 1.5000
## Mean : 0.1862
## 3rd Qu.: 11.0000
## Max. : 47.0000
## NA's :7
As you can see, our columns are all numerical, except two columns of country and GNI_per_capita, which are class
#column names
colnames(hd)
## [1] "HDI.Rank"
## [2] "Country"
## [3] "Human.Development.Index..HDI."
## [4] "Life.Expectancy.at.Birth"
## [5] "Expected.Years.of.Education"
## [6] "Mean.Years.of.Education"
## [7] "Gross.National.Income..GNI..per.Capita"
## [8] "GNI.per.Capita.Rank.Minus.HDI.Rank"
As you can see, column names are long, so in next phases, we will shorten the names.
#see head and tail of the data
"head(hd)" # I comment these lines out to shorten the report
## [1] "head(hd)"
'tail(hd)'
## [1] "tail(hd)"
#Let's do the same for gii dataset
#strcutre of data
str(gii)
## 'data.frame': 195 obs. of 10 variables:
## $ GII.Rank : int 1 2 3 4 5 6 6 8 9 9 ...
## $ Country : chr "Norway" "Australia" "Switzerland" "Denmark" ...
## $ Gender.Inequality.Index..GII. : num 0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
## $ Maternal.Mortality.Ratio : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Adolescent.Birth.Rate : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Percent.Representation.in.Parliament : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ Population.with.Secondary.Education..Female.: num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Population.with.Secondary.Education..Male. : num 96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
## $ Labour.Force.Participation.Rate..Female. : num 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
## $ Labour.Force.Participation.Rate..Male. : num 68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
#dimension of file (row* column)
dim(gii)
## [1] 195 10
#summary of file
summary(gii)
## GII.Rank Country Gender.Inequality.Index..GII.
## Min. : 1.00 Length:195 Min. :0.0160
## 1st Qu.: 47.75 Class :character 1st Qu.:0.2030
## Median : 94.00 Mode :character Median :0.3935
## Mean : 94.31 Mean :0.3695
## 3rd Qu.:141.25 3rd Qu.:0.5272
## Max. :188.00 Max. :0.7440
## NA's :7 NA's :33
## Maternal.Mortality.Ratio Adolescent.Birth.Rate
## Min. : 1.0 Min. : 0.60
## 1st Qu.: 16.0 1st Qu.: 15.45
## Median : 69.0 Median : 40.95
## Mean : 163.2 Mean : 49.55
## 3rd Qu.: 230.0 3rd Qu.: 71.78
## Max. :1100.0 Max. :204.80
## NA's :10 NA's :5
## Percent.Representation.in.Parliament
## Min. : 0.00
## 1st Qu.:12.47
## Median :19.50
## Mean :20.60
## 3rd Qu.:27.02
## Max. :57.50
## NA's :3
## Population.with.Secondary.Education..Female.
## Min. : 0.9
## 1st Qu.: 27.8
## Median : 55.7
## Mean : 54.8
## 3rd Qu.: 81.8
## Max. :100.0
## NA's :26
## Population.with.Secondary.Education..Male.
## Min. : 3.20
## 1st Qu.: 38.30
## Median : 60.00
## Mean : 60.29
## 3rd Qu.: 85.80
## Max. :100.00
## NA's :26
## Labour.Force.Participation.Rate..Female.
## Min. :13.50
## 1st Qu.:44.50
## Median :53.30
## Mean :52.61
## 3rd Qu.:62.62
## Max. :88.10
## NA's :11
## Labour.Force.Participation.Rate..Male.
## Min. :44.20
## 1st Qu.:68.88
## Median :75.55
## Mean :74.74
## 3rd Qu.:80.15
## Max. :95.50
## NA's :11
As summary shows, all variables, except country, are numerical variable
#column names
colnames(gii)
## [1] "GII.Rank"
## [2] "Country"
## [3] "Gender.Inequality.Index..GII."
## [4] "Maternal.Mortality.Ratio"
## [5] "Adolescent.Birth.Rate"
## [6] "Percent.Representation.in.Parliament"
## [7] "Population.with.Secondary.Education..Female."
## [8] "Population.with.Secondary.Education..Male."
## [9] "Labour.Force.Participation.Rate..Female."
## [10] "Labour.Force.Participation.Rate..Male."
#see head and tail of the data
'head(gii)' # I comment these lines out to shorten the report
## [1] "head(gii)"
'tail(gii)'
## [1] "tail(gii)"
Task 4 : rename the variables with (shorter) descriptive names
colnames(hd)
## [1] "HDI.Rank"
## [2] "Country"
## [3] "Human.Development.Index..HDI."
## [4] "Life.Expectancy.at.Birth"
## [5] "Expected.Years.of.Education"
## [6] "Mean.Years.of.Education"
## [7] "Gross.National.Income..GNI..per.Capita"
## [8] "GNI.per.Capita.Rank.Minus.HDI.Rank"
colnames(hd) <- c('HDI_Rank', 'Country', 'HDI', 'Life_Expec_Birth',
'Expec_yr_Edu', 'Mean_yr_Edu', 'GNI_per_Cap',
'GNI_per_Cap_Min_HDI_Rank')
colnames(hd)
## [1] "HDI_Rank" "Country"
## [3] "HDI" "Life_Expec_Birth"
## [5] "Expec_yr_Edu" "Mean_yr_Edu"
## [7] "GNI_per_Cap" "GNI_per_Cap_Min_HDI_Rank"
so you can see that column names were shorten.
Let’s do the same for gii dataset
colnames(gii)
## [1] "GII.Rank"
## [2] "Country"
## [3] "Gender.Inequality.Index..GII."
## [4] "Maternal.Mortality.Ratio"
## [5] "Adolescent.Birth.Rate"
## [6] "Percent.Representation.in.Parliament"
## [7] "Population.with.Secondary.Education..Female."
## [8] "Population.with.Secondary.Education..Male."
## [9] "Labour.Force.Participation.Rate..Female."
## [10] "Labour.Force.Participation.Rate..Male."
colnames(gii) <- c('GII_Rank', 'Country', 'GII', 'Mortality_R',
'Adolescent_Birth_R', 'Representation_Parliament',
'Sec_Edu_Fem','Sec_Edu_Mal',
"Lab_Forc_Particip_R_Fem",
"Lab_Forc_Particip_R_Mal" )
colnames(gii)
## [1] "GII_Rank" "Country"
## [3] "GII" "Mortality_R"
## [5] "Adolescent_Birth_R" "Representation_Parliament"
## [7] "Sec_Edu_Fem" "Sec_Edu_Mal"
## [9] "Lab_Forc_Particip_R_Fem" "Lab_Forc_Particip_R_Mal"
So you can see that column names were shorten.
Task 5 . Mutate the “Gender inequality” data and create two new variables
library(dplyr)
gii <- mutate(gii, edu_R = (Sec_Edu_Fem/Sec_Edu_Mal))
head(gii)
## GII_Rank Country GII Mortality_R Adolescent_Birth_R
## 1 1 Norway 0.067 4 7.8
## 2 2 Australia 0.110 6 12.1
## 3 3 Switzerland 0.028 6 1.9
## 4 4 Denmark 0.048 5 5.1
## 5 5 Netherlands 0.062 6 6.2
## 6 6 Germany 0.041 7 3.8
## Representation_Parliament Sec_Edu_Fem Sec_Edu_Mal Lab_Forc_Particip_R_Fem
## 1 39.6 97.4 96.7 61.2
## 2 30.5 94.3 94.6 58.8
## 3 28.5 95.0 96.6 61.8
## 4 38.0 95.5 96.6 58.7
## 5 36.9 87.7 90.5 58.5
## 6 36.9 96.3 97.0 53.6
## Lab_Forc_Particip_R_Mal edu_R
## 1 68.7 1.0072389
## 2 71.8 0.9968288
## 3 74.9 0.9834369
## 4 66.4 0.9886128
## 5 70.6 0.9690608
## 6 66.4 0.9927835
instead of using mutate, this is also possible:
gii$edu_R <- gii$Sec_Edu_Fem/gii$Sec_Edu_Mal
gii <- mutate(gii, Lab_Forc_R = (Lab_Forc_Particip_R_Fem/Lab_Forc_Particip_R_Mal))
head(gii)
## GII_Rank Country GII Mortality_R Adolescent_Birth_R
## 1 1 Norway 0.067 4 7.8
## 2 2 Australia 0.110 6 12.1
## 3 3 Switzerland 0.028 6 1.9
## 4 4 Denmark 0.048 5 5.1
## 5 5 Netherlands 0.062 6 6.2
## 6 6 Germany 0.041 7 3.8
## Representation_Parliament Sec_Edu_Fem Sec_Edu_Mal Lab_Forc_Particip_R_Fem
## 1 39.6 97.4 96.7 61.2
## 2 30.5 94.3 94.6 58.8
## 3 28.5 95.0 96.6 61.8
## 4 38.0 95.5 96.6 58.7
## 5 36.9 87.7 90.5 58.5
## 6 36.9 96.3 97.0 53.6
## Lab_Forc_Particip_R_Mal edu_R Lab_Forc_R
## 1 68.7 1.0072389 0.8908297
## 2 71.8 0.9968288 0.8189415
## 3 74.9 0.9834369 0.8251001
## 4 66.4 0.9886128 0.8840361
## 5 70.6 0.9690608 0.8286119
## 6 66.4 0.9927835 0.8072289
Task 6 .Join together the two datasets using the variable Country as the identifier
TIP we use inner join to keep only observations in both data sets
hd_gii_joint <- inner_join(hd, gii,
by = 'Country', suffix = c("_hd", "_gi"))
'head(hd_gii_joint)' # I comment theis line out to shorten the
## [1] "head(hd_gii_joint)"
dim(hd_gii_joint)
## [1] 195 19
str(hd_gii_joint)
## 'data.frame': 195 obs. of 19 variables:
## $ HDI_Rank : int 1 2 3 4 5 6 6 8 9 9 ...
## $ Country : chr "Norway" "Australia" "Switzerland" "Denmark" ...
## $ HDI : num 0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
## $ Life_Expec_Birth : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Expec_yr_Edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Mean_yr_Edu : num 12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
## $ GNI_per_Cap : chr "64,992" "42,261" "56,431" "44,025" ...
## $ GNI_per_Cap_Min_HDI_Rank : int 5 17 6 11 9 11 16 3 11 23 ...
## $ GII_Rank : int 1 2 3 4 5 6 6 8 9 9 ...
## $ GII : num 0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
## $ Mortality_R : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Adolescent_Birth_R : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Representation_Parliament: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ Sec_Edu_Fem : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Sec_Edu_Mal : num 96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
## $ Lab_Forc_Particip_R_Fem : num 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
## $ Lab_Forc_Particip_R_Mal : num 68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
## $ edu_R : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Lab_Forc_R : num 0.891 0.819 0.825 0.884 0.829 ...
#setwd("./data")
write.csv(hd_gii_joint, "./data/human.csv")
(for this week)Now in this week, lets continue to data wrangling and analysis in theme of dimensionality reduction
human <- read.csv("./data/human.csv")
#altrnatively, you can read directly from web:
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human1.txt", sep =",", header = T)
# check column names of the data
names(human)
## [1] "HDI.Rank" "Country" "HDI" "Life.Exp"
## [5] "Edu.Exp" "Edu.Mean" "GNI" "GNI.Minus.Rank"
## [9] "GII.Rank" "GII" "Mat.Mor" "Ado.Birth"
## [13] "Parli.F" "Edu2.F" "Edu2.M" "Labo.F"
## [17] "Labo.M" "Edu2.FM" "Labo.FM"
#dimention of dataset
dim(human)
## [1] 195 19
Data description:
This data originates from the United Nations Development Programme. It gives us information about Human Development status in different countries by different variables.
The data combines several indicators from most countries in the world. More info in http://hdr.undp.org/en/content/human-development-index-hdi
The variables names and their descriptions are as following: “Country” = Country name
Health and knowledge
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate
Empowerment
“Parli.F” = Percentage of female representatives in parliament
“Edu2.F” = Proportion of females with at least secondary education
“Edu2.M” = Proportion of males with at least secondary education
“Labo.F” = Proportion of females in the labour force
“Labo.M” " Proportion of males in the labour force
“Edu2.FM” = Edu2.F / Edu2.M
“Labo.FM” = Labo2.F / Labo2.M"
NOTE: Since I personally do not like dots (.) to be in the file or columns names, I like to change it to underscore (_)
colnames(human) <- c("HDI_Rank", "Country", "HDI", "Life_Exp",
"Edu_Exp", "Edu_Mean", "GNI", "GNI_Minus_Rank",
"GII_Rank","GII","Mat_Mor","Ado_Birth",
"Parli_F","Edu2_F","Edu2_M","Labo_F",
"Labo_M","Edu2_FM","Labo_FM")
#Source: https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/datasets/human_meta.txt
# Check Column names again that changed successfully
colnames(human)
## [1] "HDI_Rank" "Country" "HDI" "Life_Exp"
## [5] "Edu_Exp" "Edu_Mean" "GNI" "GNI_Minus_Rank"
## [9] "GII_Rank" "GII" "Mat_Mor" "Ado_Birth"
## [13] "Parli_F" "Edu2_F" "Edu2_M" "Labo_F"
## [17] "Labo_M" "Edu2_FM" "Labo_FM"
# look at the structure of human
'str(human)' # I comment this line out to shorten the
## [1] "str(human)"
# print out summaries of the variables
summary(human)
## HDI_Rank Country HDI Life_Exp
## Min. : 1.00 Length:195 Min. :0.3480 Min. :49.00
## 1st Qu.: 47.75 Class :character 1st Qu.:0.5770 1st Qu.:65.75
## Median : 94.00 Mode :character Median :0.7210 Median :73.10
## Mean : 94.31 Mean :0.6918 Mean :71.07
## 3rd Qu.:141.25 3rd Qu.:0.8000 3rd Qu.:76.80
## Max. :188.00 Max. :0.9440 Max. :84.00
## NA's :7
## Edu_Exp Edu_Mean GNI GNI_Minus_Rank
## Min. : 4.10 Min. : 1.400 Length:195 Min. :-84.0000
## 1st Qu.:11.10 1st Qu.: 5.550 Class :character 1st Qu.: -9.0000
## Median :13.10 Median : 8.400 Mode :character Median : 1.5000
## Mean :12.86 Mean : 8.079 Mean : 0.1862
## 3rd Qu.:14.90 3rd Qu.:10.600 3rd Qu.: 11.0000
## Max. :20.20 Max. :13.100 Max. : 47.0000
## NA's :7
## GII_Rank GII Mat_Mor Ado_Birth
## Min. : 1.00 Min. :0.0160 Min. : 1.0 Min. : 0.60
## 1st Qu.: 47.75 1st Qu.:0.2030 1st Qu.: 16.0 1st Qu.: 15.45
## Median : 94.00 Median :0.3935 Median : 69.0 Median : 40.95
## Mean : 94.31 Mean :0.3695 Mean : 163.2 Mean : 49.55
## 3rd Qu.:141.25 3rd Qu.:0.5272 3rd Qu.: 230.0 3rd Qu.: 71.78
## Max. :188.00 Max. :0.7440 Max. :1100.0 Max. :204.80
## NA's :7 NA's :33 NA's :10 NA's :5
## Parli_F Edu2_F Edu2_M Labo_F
## Min. : 0.00 Min. : 0.9 Min. : 3.20 Min. :13.50
## 1st Qu.:12.47 1st Qu.: 27.8 1st Qu.: 38.30 1st Qu.:44.50
## Median :19.50 Median : 55.7 Median : 60.00 Median :53.30
## Mean :20.60 Mean : 54.8 Mean : 60.29 Mean :52.61
## 3rd Qu.:27.02 3rd Qu.: 81.8 3rd Qu.: 85.80 3rd Qu.:62.62
## Max. :57.50 Max. :100.0 Max. :100.00 Max. :88.10
## NA's :3 NA's :26 NA's :26 NA's :11
## Labo_M Edu2_FM Labo_FM
## Min. :44.20 Min. :0.1717 Min. :0.1857
## 1st Qu.:68.88 1st Qu.:0.7284 1st Qu.:0.5987
## Median :75.55 Median :0.9349 Median :0.7514
## Mean :74.74 Mean :0.8541 Mean :0.7038
## 3rd Qu.:80.15 3rd Qu.:0.9968 3rd Qu.:0.8513
## Max. :95.50 Max. :1.4967 Max. :1.0380
## NA's :11 NA's :26 NA's :11
The summary shows the minimum, maximum, mean, and 1st and 3rd quartiles of the data together with median. It also shows if there as NA cells in our columns. Moreover, we can see that all variables, except Country and GNI index, are numerical
Task 1 : Mutate the data: transform the Gross National Income (GNI) variable to numeric"
# access the stringr package
library(stringr)
# look at the structure of the GNI column in 'human'
str(human$GNI)
## chr [1:195] "64,992" "42,261" "56,431" "44,025" "45,435" "43,919" "39,568" ...
class(human$GNI)
## [1] "character"
As you can see, this GNI column is not numerical (is character class), so we shall change it as numeric.
# remove the commas from GNI and print out a numeric version of it
human$GNI <- str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric
str(human$GNI)
## num [1:195] 64992 42261 56431 44025 45435 ...
class(human$GNI)
## [1] "numeric"
Task 2: keep only the columns matching the following variable names"
# columns to keep
keep <- c("Country", "Edu2_FM", "Labo_FM", "Life_Exp", "Edu_Exp",
"GNI", "Mat_Mor", "Ado_Birth", "Parli_F")
# select the 'keep' columns
human <- dplyr::select(human, one_of(keep))
length(keep)
## [1] 9
ncol(human)
## [1] 9
So the number of columns of dataset is now equal to the number of variables we like to keep.
Task 3 : Remove all rows with missing values
# print out a completeness indicator of the 'human' data
complete.cases(human)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [13] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
## [61] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [97] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [109] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
## [145] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [157] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [169] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [181] TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [193] TRUE TRUE TRUE
complete.cases function Return a logical vector indicating which cases are complete, i.e., have no missing values
# filter out all rows with NA values
human_filterd <- filter(human, complete.cases(human))
dim(human_filterd)
## [1] 162 9
dim(human)
## [1] 195 9
So you can see that the original human dataset had 195 rows, an now has 162.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.4 v purrr 0.3.4
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x plotly::select() masks MASS::select(), dplyr::select()
human_filterd_new = human %>% drop_na()
dim(human_filterd_new)
## [1] 162 9
#Or alternatively same
human_ = human %>% drop_na(colnames(human))
dim(human_)
## [1] 162 9
Task 4: Remove the observations which relate to regions instead of countries
#look at the last 10 observations
tail(human_filterd, 10)
## Country Edu2_FM Labo_FM Life_Exp Edu_Exp GNI
## 153 Chad 0.1717172 0.8080808 51.6 7.4 2085
## 154 Central African Republic 0.3782772 0.8531140 50.7 7.2 581
## 155 Niger 0.3076923 0.4459309 61.4 5.4 908
## 156 Arab States 0.7289916 0.3081009 70.6 12.0 15722
## 157 East Asia and the Pacific 0.8250377 0.7884131 74.0 12.7 11449
## 158 Europe and Central Asia 0.8784119 0.6514286 72.3 13.6 12791
## 159 Latin America and the Caribbean 0.9836957 0.6729323 75.0 14.0 14242
## 160 South Asia 0.5329670 0.3711083 68.4 11.2 5605
## 161 Sub-Saharan Africa 0.7015873 0.8537859 58.5 9.6 3363
## 162 World 0.8333333 0.6558018 71.5 12.2 14301
## Mat_Mor Ado_Birth Parli_F
## 153 980 152.0 14.9
## 154 880 98.3 12.5
## 155 630 204.8 13.3
## 156 155 45.4 14.0
## 157 72 21.2 18.7
## 158 28 30.8 19.0
## 159 85 68.3 27.0
## 160 183 38.7 17.5
## 161 506 109.7 22.5
## 162 210 47.4 21.8
# last indice we want to keep
last <- nrow(human_filterd) - 7
# choose everything until the last 7 observations
human_ <- human_filterd[1:last, ]
Task 5. Define the row names of the data by the country names
# add countries as rownames
rownames(human_filterd) <- human_filterd$Country
# remove the Country variable
human_ <- select(human_filterd, -Country)
# load library
library(GGally)
library(corrplot)
# visualize the 'human_' variables
ggpairs(human_)
As the graph shows, there is strong correlation between some variables such as ado_birth abd mat_mor and edu_exp. similarly, GNI and life_exp and edu_exp.
# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot(type = "lower")
As graph shows, there is strong negative correlation between Mat_Mor and Life_Exp, which means by increase of mmat_mor, life_exp increases. Same for Life_Exp and Abo_birth.
Also, some positive correlation between abo_birth and Mat_Mor. Parli_F and labo_FM are not correlating with other variables.
TIP: by including ´(type = “lower”)` you set it to plot lower triangular part of correlation matirx.
write.csv(human_, "human_new.csv")
human <- read.csv("human_new.csv")
# Alternatively, yu can read directly from web
human <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep = ",", stringsAsFactors = F)
dim(human)
## [1] 155 8
Task 1 : Show a graphical overview
# standardize the variables
human_std <- scale(human)
As center = True (by default), now, this scale is done now by cantering which is done by subtracting the column means (omitting NAs) of x from their corresponding columns.
We standardize the data, because PCA is sensitive to relative scaling od the orignal features.
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
As you can see from the summary of scaled and not-scaled data, the scaled one is standardized based on mean of each column, so data are not heteroscedasticity effect. So, we are not ready for further analysis.
ggpairs(as.data.frame(human_std))
This visualises the data distribution for each column and comparing with other columns.
cor(as.data.frame(human_std)) %>% corrplot(type = "lower")
The correction matrix (same as above) is given here, same interpretation as in data wrangling part
Task 2 Perform principal component analysis (PCA) with the SVD method
As instructed, we do PCA on not standardized human data.
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The arrows show the original variables (there are overlapping on each other).
Task 3 Perfom PCa with standardize variables
pca_human <- prcomp(human_std)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
The arrows show the original variables
# rounded percentages of variance captured by each PC
pca_pr <- round(100*summary(pca_human)$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Interpretation In the biplot small angle between arrows means high correlation between two variables. Thus, as Parli_F and Edu.Exp abd life_exp has high angle, thus don’t have high correlation (as also shown earlier in above correlation graphs).
On the other hand, Edu.Exp and GNI and Edu2.FM has very narrow angle, thus very high correlation (again, explain in above correlation plots).
Additionally, we can see that Principal component 2 (PC2), in Y axis, has same direction to Parli_F and Labo_FM variables, thus, PC2 is contributing to that features (dimensions), while all other variables has contributing to PC1. More info in https://vimeo.com/204165420
As for PCA components printed by summary function, the results differ, which is as expected. Because we standardised the data to have not heteroscedasticity. Moreover, since the data were standardized, the Standard deviation and Proportion of Variance will be different with non-standardised dataset.
We shall remember the PCA is sensitive to scaling of variables.
To interpret PCA, we could look at Proportion of Variance and its cumulative proportion. For example, in the standardized data, the PCA shows that PCA1 contributes to 53.6% of the variance in the data, and PCA2 contributes to 16.3%. Comparing this result with non-standardised result, the data and results has changed which is as expected because there were some heterostasity in the data which CA is sensitive for. PCA1 already could contribute over 99% of variance which looks doubtful to me. Thus, standardising the data is very effective and helpful to use before PCA analysis and getting the dimension of data reduced!
It makes sense, because in actual phenomenon’s, Gross National Income per capita (GNI) is higher if Expected years of schooling (Edu_Exp) is higher, and therefore, Life expectancy (Life_Exp) will be higher.
Task 4: I personally think that using PCA is very useful because cool and very useful also in my own research in forest data science where I often have tens of variables. PCA 1 and PCA 2 together show/contribute to nearly 70% (53.6+16.2) of the variance in the original data.
inspired from https://cran.r-project.org/web/packages/pca3d/vignettes/pca3d.pdf
library(pca3d)
gr<-factor(rownames(human_std))
#summary(gr)
pca3d(pca_human, group=gr)
## [1] 0.12143416 0.06251040 0.05134851
snapshotPCA3d(file="first_plot.png")
feel free to zoom on it or rotate it.
Every item in 3D is an observation (country) in 3D space created by PC1, PC2 and PC3.
Task 5: Multiple Correspondence Analysis
Let’s do Multiple Correspondence Analysis (MCA) using the Factominer package, which contains functions dedicated to multivariate explanatory data analysis. It contains for example methods (Multiple) Correspondence analysis, Multiple Factor analysis as well as PCA.
We will do only Multiple Correspondence Analysis (MCA) with this data.
In this part I am going to use the tea dataset. The dataset contains the answers of a questionnaire on tea consumption.
library(FactoMineR)
library(ggplot2)
data(tea)
View(tea)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
As you can see, the data include information on tea consumption, e.g. 74 of black tea, 193 Earl Grey and 33 green tea, and how people drink the tea etc. We are doing to plot them and see visally ( as following).
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
So you can see the data that include information on tea consumption, multiple columns about how it was used, what type of tea where and when they drink it.
We use some of the columns for MCA analysis
mca <- MCA(tea_time, graph = FALSE)
#summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
Interpreation: MCA is another approach for dimensionality reduction. MCA works well with categerical variables, as PCA for numerical variables. #As summary show, the MCA gives us for example Eigenvalues, which include Variance, percentage of variance and cumulative percentage of variance for each dimension. As it shows, dimention 1, contains/contributes to 15.238 of variance in the data, Dim 2, 14% and Dim 3 nearly 12%. It gradually declines as I plot the in the following extra coding parts.
Additionally, next part of summary shows the individuals (10 first here) for each dimension, contribution of each dimension (ctr), and squared correlation of each dimension (cos2). For example, Dim1, which the coordinated is -0.298, has contribution of 0.106 and squared correlation is 0.086. Note that Dim here is equalling to each principal component (PC) in the PCA. So, both approached (MCA and PCA) we transform the original data.
The next part, shows Categories (the 10 first), has v-test values (like normal t-test) in addition to the same stuff explained above. v.test value for black tea is 4.677 which means the coordinates is significantly different from zero. all values below or above +/-1.96 means that.
Final part of the summary shows Categorical variables (eta2) which shows squared correlation between each variable and dimension. The closer to 1, the stronger the correlation is. For example, how category and Dim.1 has strong correlation (0.708), which lunch and Dim.1 has no correlation (0.0). For more info please check https://vimeo.com/204251158
Visualize MCA by biplot
plot(mca, invisible=c("ind"), habillage = "quali")
Interpreation the plot shows the MCA factor map, which is made by plotting Dim1 and Dim2 in x and y axises, respectively. It made easy ro see the possible variables patterns. The shorter distance between the variables the more similar they are, e.g. No sugar and No lunch in middle of the graph, which makes a good sense :). But tea bag and unpacked are far from each other which shows dissimilarities between those.
inspritred from; http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/
library("factoextra")
# Color by cos2 values: quality on the factor map
fviz_mca_var(mca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping
ggtheme = theme_minimal())
The graph shows the squared correlation (cos2) in the variables. Which is a new thig to this is the same graph as above.
Or yet another way of plotting
factoextra::fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))
Interpreation: We can see that by increasing the dimension, the percentage of explained variance is declining. It means that the first variables contribute to the variance of the data and make it kind of saturated, that for example 10th dimension has less to contribute to variance, because most of variance it already filled and shown by earlier dimensions.
Mohamamd Imangholiloo
My GitHub repository is https://github.com/Imangholiloo/IODS-project
Longitudinal data are those that multiple observations or measurements of the same individuals exist, thus, observation of the data could be correlated as well. So far, we learned about correlated variables, but it could happen that observations itself correlate (because was measured of same individual at different times, or under different treatments. For example, this could happen if an observation is measured twice or multiple times (repeatedly). It is called longitudinal data.
In this practical, we will analyse and understand handling of this effect. Converting the data between the wide form and the long form is important to learn. Because, typically, these types of data appear in the wide form, but most of the analyses require the data to be in the long form.
Note that main objective in the analysis of longitudinal data is to detect the change in the repeated values of the response variable and to determine the explanatory variables most associated with any change.
So, stay tuned!
We will prepare and analyse two data sets, BPRS and RATS.
Task 1 Read and understand the data
Data description BPRS data have 40 male subjects which were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations, and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
NOTE: The swapped data are in latter half of this report, because I pro-actively did both chapters and both methods. Hence, I am happy to present all results.So, please remember that the swapped data are in latter part of this report.
#Read the BPRS data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
# Look at the (column) names of BPRS
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2" "week3"
## [7] "week4" "week5" "week6" "week7" "week8"
# Look at the structure of BPRS
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
#check dimention of the data
dim(BPRSL)
## Error in eval(expr, envir, enclos): object 'BPRSL' not found
# print out summaries of the variables
summary(BPRS)
## treatment subject week0 week1 week2
## Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00 Min. :26.0
## 1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## Median :1.5 Median :10.50 Median :46.00 Median :41.00 Median :38.0
## Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33 Mean :41.7
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00 Max. :75.0
## week3 week4 week5 week6 week7
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00 Min. :18.0
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75 1st Qu.:23.0
## Median :36.50 Median :34.50 Median :30.50 Median :28.50 Median :30.0
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23 Mean :32.2
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00 3rd Qu.:38.0
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00 Max. :62.0
## week8
## Min. :20.00
## 1st Qu.:22.75
## Median :28.00
## Mean :31.43
## 3rd Qu.:35.25
## Max. :75.00
# see first 10 rows of data
head(BPRS)
## treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1 1 1 42 36 36 43 41 40 38 47 51
## 2 1 2 58 68 61 55 43 34 28 28 28
## 3 1 3 54 55 41 38 43 28 29 25 24
## 4 1 4 55 77 49 54 56 50 47 42 46
## 5 1 5 72 75 72 65 50 39 32 38 32
## 6 1 6 48 43 41 38 36 29 33 27 25
We can see that there are some treatments and subjects which were seemed to be measured in different weeks (week 1-8). It was labelled repeated measurements; which is, nowadays, common to measure the response variable under a number of different experimental conditions or on a number of different occasions over time; such data are labelled repeated measures or longitudinal data. For example, person 1 were under treatment 1 and subject 1 and outputted different brief psychiatric rating scale (BPRS) values in weeks 0 to 8.
Task 2
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
Now we set the treatment and subjects columns as factor, so computer handles them as non-continus variables.
Task 3: Convert the data sets to long form. Add a week variable to BPRS and a Time variable to RATS.
# Convert to long form
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
dim(BPRSL)
## [1] 360 4
dim(BPRS)
## [1] 40 11
As you can see, we used Gather function to Gather columns into key-value pairs, so the three columns called Weeks and it contains now week numbers and bprs value is the value of measurement. So, now we converted the data to long form.
Our weeks are in a bit inconvenient form as characters, so we somehow need to extract the week numbers from the character vector weeks.
# Extract the week number
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
dim(BPRSL)
## [1] 360 5
So, now we made a new integer column of weeks number.
# Take a glimpse at the BPRSL data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
We can see that we have 360 rows and 5 columns, which is different than wide form of the data. During transformation of data from wider to long form, we re-structured the weeks to be under one column and brps values to be same. Added a new column called ‘week’ which is the integer number of weeks, because the column ‘weeks’ contains the week0 etc, which is character, which could be hard in handing numerical data analysis in next phase.
Before wrangling of RATS dataset (which is in below) let’s plot some graphical of this data.
Now Let’s make nice descriptive graphical plots
To begin we should plot the BPRS values for all 40 men, differentiating between the treatment groups into which the men have been randomized. This simple graph makes a number of features of the data readily apparent.
#Load the library to R
library(ggplot2)
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "right") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Each subject (shown by lines) are each individual (men) who participated in the experiment so there were 20 men per treatment subjects thus 20 lines in each graphs.
We can see how the subjects (per person) and bprs values changed (generally declined) during weeks for persons, for each treatmets.
#let's write the wrangled data to files in drive
write.csv(BPRSL, "BPRSL.csv")
Task 1: Implement the analyses of Chapter 8 of MABS using the RATS data.
As mentioned in the textbook of the course, the graphical Displays of Longitudinal Data is to:
"• Show as much of the relevant raw data as possible rather than only data summaries
• Highlight aggregate patterns of potential scientific interest
• Identify both cross-sectional and longitudinal patterns
• Try to make the identification of unusual individuals or unusual observations simple.
Standardise the variable bprs Standardising the values of a phenomenon can lead to a better and clearer tracking and viewing the data. We can do it by subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation.
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
ungroup()
# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.6983632, 0...
# Plot again with the standardised bprs
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
This graph shows that there are considerable individual differences and variability which are appeared to decrease with time.
# Number of weeks, baseline (week 0) included
n <- BPRSL$week %>% unique() %>% length()
So there were 9 weeks, i.e. the duration of treatment.
# Summary data with mean and standard error of bprs by treatment and week
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise( mean = mean(bprs), se = sd(bprs)/sqrt(n) ) %>%
ungroup()
## `summarise()` regrouping output by 'treatment' (override with `.groups` argument)
# Glimpse the data
glimpse(BPRSS)
## Rows: 18
## Columns: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ week <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8
## $ mean <dbl> 47.00, 46.80, 43.55, 40.90, 36.60, 32.70, 29.70, 29.80, 2...
## $ se <dbl> 4.534468, 5.173708, 4.003617, 3.744626, 3.259534, 2.59576...
# Plot the mean profiles
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
geom_line() +
scale_linetype_manual(values = c(1,2)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
Now, you can see the mean BPRS value with standard deviation (shown in vertical bars for each week.
We can see that treatment 1 and 2 were started with about 47 and 49 bprs, respectively and declined as time went. The decrease in bprs value of treatment 1 was more than in treatment 2, because treatment 2 did not gave lower value than about 36, while treatment 1 levelled off to below 30.
#———————– Find and discard outliers Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0).
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
## `summarise()` regrouping output by 'treatment' (override with `.groups` argument)
# Glimpse the data
glimpse(BPRSL8S)
## Rows: 40
## Columns: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.125, 3...
# Draw a boxplot of the mean versus treatment
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
## Warning: `fun.y` is deprecated. Use `fun` instead.
In this graph, the mean of weeks 1 to 8 per each treatment (1 and 2) was measured. We can see that mean of treatment 1 is higher than treatment 2.
From the boxplots, we can see that in treatment 2, there is an outlier which has value over 70.
So let’s filter it out.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
BPRSL8S1 <- BPRSL8S %>%
filter(mean < 60)
Here, we used ‘filter’ function which subsets rows using column values, in this case mean < 60, so keeps any value below 60.
A t-test to check group differences
Although the graph showed all indicated a lack of difference in the two treatment groups, we better check with statistical tests like t-test if the difference exist between treatments.
we will use the data without outlier. So, lets proceed…
# Perform a two-sample t-test
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by treatment
## t = 0.52095, df = 37, p-value = 0.6055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.232480 7.162085
## sample estimates:
## mean in group 1 mean in group 2
## 36.16875 34.70395
Interpretation: As t-value is 0.52095 (t = 0.52095), our null hypothesis (there is difference between treatments) shall be rejected. So, it confirms the lack of any evidence for a group difference within confidence interval of above 95%, because t-value is not smaller than 0.05.
Linear regression and ANOVA analysis
Now, let’s add a baseline to our calculations, because as book says baseline measurements of the outcome variable in a longitudinal study are often correlated with the chosen summary measure and using such measures in the analysis can often lead to substantial gains in precision when used appropriately as a covariate in an analysis of covariance.
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
mutate(baseline = BPRS$week0)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment 1 3.45 3.45 0.0557 0.8148
## Residuals 37 2292.97 61.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test shows that the baseline is statistically significant with BPRS values taken after treatment has begun, but treatments are not statistically significant (Pr = 0.8148 which is higher than 0.05 confidence interval) difference in BPRS values even after conditioning on the baseline value. Thus, same conclusion as t-test (explained above).
Analyzing rats data
´Now, let’s do the same for RATS data
# read the RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
Data description: This dataset belong to a nutrition study carried out using three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
So, let’s study the question…
library(GGally)
ggpairs(RATS, lower = list(combo = wrap("facethist", bins = 20)))
We can see that the varables are all highly (above 0.9) correlated to each other, so we need to wonder about it. Stay tuned, we will tell in a bit.
# Factor variables ID and Group
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
# Glimpse the data
glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4...
## $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 4...
## $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 4...
## $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 4...
## $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 4...
## $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 4...
## $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 4...
## $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 5...
## $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 5...
## $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 5...
## $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 5...
Linear Mixed Effects Models:
# Convert data to long form
## Converting the long form is explained in above for BPRS data, so I don't repeat the explanation now
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>% ###Convert data to long form
mutate(Time = as.integer(substr(WD,3,4))) ### Extract the week number
dim(RATSL)
## [1] 176 5
dim(RATS)
## [1] 16 13
As you can see, we used Gather function to Gather columns into key-value pairs, so the three columns called Time and it contains now week numbers and rats value is the value of measurement. So, now we converted the data to long form.
So now we made a new integer column of weeks number.
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
You can see that we have 176 rows and 5 columns.
#------------------
"Plot first, ask questions later"
## [1] "Plot first, ask questions later"
# Check the dimensions of the data
dim(RATSL)
## [1] 176 5
# Let's draw a plot
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
As plot shows, the weights of rats generally increased over time of experiments. The weight of group 1 rats were distinguishable lower than group 2 and 3. the increase in weight of group 2 and 3 rats were seems to be sharper than group 1 rats.
Additionally, it seems that the repeated measures are dependent of one another.
#let's write the wrangled data to files in drive
write.csv("RATSL, RATSL.csv")
## "","x"
## "1","RATSL, RATSL.csv"
Linear model
Let’s fit a multiple linear regression model with ignoring repeated-measures structure of the data, i.e. we hold on with the independence.
# create a regression model RATS_reg
RATS_reg <- lm(Weight ~ Time + Group, data = RATSL)
We fitted a multiple linear regression model with weight as response and Time and Group as explanatory variables.
# print out a summary of the model
summary(RATS_reg)
##
## Call:
## lm(formula = Weight ~ Time + Group, data = RATSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.643 -24.017 0.697 10.837 125.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 244.0689 5.7725 42.281 < 2e-16 ***
## Time 0.5857 0.1331 4.402 1.88e-05 ***
## Group2 220.9886 6.3402 34.855 < 2e-16 ***
## Group3 262.0795 6.3402 41.336 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.34 on 172 degrees of freedom
## Multiple R-squared: 0.9283, Adjusted R-squared: 0.9271
## F-statistic: 742.6 on 3 and 172 DF, p-value: < 2.2e-16
Interpretation Becuase the P-values of all variables (time, group 2 and 3) were very much lower than 0.05, thus they are definately different from each other. Thus, they are all highly significantly different.
Also there is high variation in the intercepts of the regression fits of the individual rat growth profiles.
Notable, because we ignored the probable within-subject dependences, the standard error of a within-subject covariate such as time being larger. Thus, now we should create Random Intercept Model to consider it.
Random Intercept Model
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
# Create a random intercept model
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
Note the random-effects terms distinguished by vertical bars (|).
# Print the summary of the model
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1333.2 1352.2 -660.6 1321.2 170
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5386 -0.5581 -0.0494 0.5693 3.0990
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1085.92 32.953
## Residual 66.44 8.151
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 244.06890 11.73107 20.80
## Time 0.58568 0.03158 18.54
## Group2 220.98864 20.23577 10.92
## Group3 262.07955 20.23577 12.95
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.090
## Group2 -0.575 0.000
## Group3 -0.575 0.000 0.333
As we can see from the summary, the estimated standard error of time (0.03158) is much smaller now than previously with regression model (0.1331), which reflects the effects of considering the probable within-subject dependence applied in this Random Intercept Model.
Standard errors of each variables (Group2 and Group3) are now at least three time bigger than the previous regression model, same for intercept of the model.
More importantly, we can see that the estimated mean weight of rats (244.06890) is larger than weight of rats in Group2 (220.98864, t = 10.92) but smaller than estimated weight of rats in Group3 (262.07955, t = 12.95). Using a rule-of-thumb of 2 for t value (introduced in this link: https://web.stanford.edu/class/psych252/section/Mixed_models_tutorial.html), we can say that t is probability significant.
Insprited from https://web.stanford.edu/class/psych252/section/Mixed_models_tutorial.html
d_bysubj = na.omit(RATSL) %>%
group_by(Group) %>%
summarise(mean_weight_of_Rats = mean(Weight))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(d_bysubj, aes(x=factor(Group), y=mean_weight_of_Rats)) +
geom_point(size=4, aes(colour = factor(Group)))
We can see that mean weight of rats in group 1 is very lower than group 2 and 3 (maximum).
Random Intercept and Random Slope Model
Now, let us proceed to fit the random intercept and random slope model to the rat growth data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to consider the individual differences in the rats’ growth profiles, but also the effect of time.
# create a random intercept and random slope model
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
# print a summary of the model
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1194.2 1219.6 -589.1 1178.2 168
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2259 -0.4322 0.0555 0.5637 2.8825
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1139.2004 33.7520
## Time 0.1122 0.3349 -0.22
## Residual 19.7479 4.4439
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 246.46821 11.80888 20.871
## Time 0.58568 0.08548 6.852
## Group2 214.55803 20.16986 10.638
## Group3 258.91288 20.16986 12.837
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.166
## Group2 -0.569 0.000
## Group3 -0.569 0.000 0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
The random intercept and slope model provide a better fit for these data. It shows that the growth rate slopes are considerably higher for rats in group 3 (estimated 258.92732) than for rats in group 2 (214.58736) with same standard error, but higher t-value for group 3. The bottom of results shows that group 2 and 3 correlate with 0.333, but intercept has higher correlation with group 2 and 3 (both -0.569).
# perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## RATS_ref 6 1333.2 1352.2 -660.58 1321.2
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2 142.94 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test of the two models show that the latter model ‘random intercept and random slope model’ is significant because P value is much less than 0.05. It’s significant in 99% significant interval.
Random Intercept and Random Slope Model with interaction
Let’s test differnet variables for the random intercept and random slope model with interaction.
# create a random intercept and random slope model
RATS_ref2 <- lmer(Weight ~ Time * Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00504145 (tol = 0.002, component 1)
# print a summary of the model
summary(RATS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time * Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1185.9 1217.6 -582.9 1165.9 166
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2667 -0.4249 0.0726 0.6034 2.7511
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1.106e+03 33.2534
## Time 4.925e-02 0.2219 -0.15
## Residual 1.975e+01 4.4439
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.65165 11.79473 21.336
## Time 0.35964 0.08215 4.378
## Group2 200.66549 20.42907 9.823
## Group3 252.07168 20.42907 12.339
## Time:Group2 0.60584 0.14229 4.258
## Time:Group3 0.29834 0.14229 2.097
##
## Correlation of Fixed Effects:
## (Intr) Time Group2 Group3 Tm:Gr2
## Time -0.160
## Group2 -0.577 0.092
## Group3 -0.577 0.092 0.333
## Time:Group2 0.092 -0.577 -0.160 -0.053
## Time:Group3 0.092 -0.577 -0.053 -0.160 0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00504145 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(RATS_ref2, RATS_ref1)
## Data: RATSL
## Models:
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## RATS_ref2: Weight ~ Time * Group + (Time | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2
## RATS_ref2 10 1185.9 1217.6 -582.93 1165.9 12.361 2 0.00207 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, we can see that the latter model in which we multiple time and groups is more significant (in 99% significance interval because P value is smaller than 0.01, compare to previous model where we did not make the multiplication.
# Create a vector of the fitted values
Fitted <- fitted(RATS_ref2)
# Create a new column fitted to RATSL
RATSL <- RATSL %>%
mutate(Fitted)
# draw the plot of original RATSL
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Observed weight (grams)") +
theme(legend.position = "top")
# draw the plot of model fitted RATSL
ggplot(RATSL, aes(x = Time, y = Fitted, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top")
We can see that the value of fitted weight, dervided from the last random intercept and random slope model look veey simlar to observations graph in earlier, thus it’s a successful and nice model. Good job!
Thus, this graph shows that the interaction model fits well with the observed data (plotted earlier).
# creating and plot the residuals of model
resid <- residuals(RATS_ref2)
qqnorm(resid)
resid.std <- resid/sd(resid)
plot(RATSL$ID, resid.std, ylim=c(-3, 3), ylab="Residuals of predicted weight of each rat")
abline(h=0, col= "blue")
It seems that the model predicted the weights well, because if the predicted mean weight, shown by middle line of each box, of each rat is closer to abline (horizontal line in middle of graph) the better. The blue line is to show the place of ideal case to have residuals close to 0.
Since we have original value and model fitted values for BPRS, we can calculate root mean square error (RMSE) the statistical metric that measures how far apart our fitted values are from our original values in a regression analysis, on average.
inspired from https://www.statology.org/how-to-calculate-rmse-in-r/
head(RATSL, 2) # to print columns with 2 row to make it easy in here to write the formula
## ID Group WD Weight Time Fitted
## 1 1 1 WD1 240 1 245.6950
## 2 2 1 WD1 225 1 226.8826
RMSE = sqrt(mean((RATSL$Weight - RATSL$Fitted)^2))
The smaller the RMSE, the better a model is able to fit the data.
Proportional RMSE
RMSE_percent = 100*RMSE/mean(RATSL$Weight)
We can also calculate Mean squered error (MSE). It is another most common metrics used to measure the prediction accuracy of a model. It is also called bias.
So, RMSE is 1.05%, which cofirms our model is predicting the weight of Rats very well. Good job!
MSE = mean((RATSL$Weight - RATSL$Fitted)^2)
MSE
## [1] 16.31714
The lower the value for MSE, the more accurately a model is able to predict values.
MSE_percent = 100*MSE/mean(RATSL$Weight)
MSE_percent
## [1] 4.243917
So, our model is 4.24% underestimating the actual weight of Rats weight"
Now, we can plot the model fitted mean weight (in X axis) vs original observed value (in Y axis).
plot(RATSL$Fitted, RATSL$Weight, main = "Measured and fitted mean weight of rats",
xlab = "Model fitted weight of rats", ylab = "Measured weight of rats")
abline(a = 0, b= 1, col = "blue")
The blue line in graph shows the 1:1 diagonal line, for example if x = 500, y = 500. So, the ideal case will be like that. So, we compare now our predicted and original value using that line, because if the measured weight of rat is 600 grams, then ideally the predicted weight should be 600 grams too. So, this line shows that. As graph shows, the predicted weights were very closed to original values, which was also shown by the RMSE and Bias numbers. For example, if the weight of rat was 500 grams, it was models very closely to that number (with MSE of 16.32%).
Now, let’s swap the data to use chapter 8 data for rats and chapter 9 for BPRS values
Apply analysis from chapter 8 of textbook for the RATS dataset
RATSL <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSL)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9...
NOTE: Since I already have the plots and their interpretations (in upper section), I go straight Analysis part.
Analysis of variance table for the fitted model, without baselines.
Fit the linear model with the mean as the response on the swapped data
fit <- lm(mean ~ Group, data = RATSL)
summary(fit)
##
## Call:
## lm(formula = mean ~ Group, data = RATSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.900 -27.169 2.325 9.069 106.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 265.02 12.98 20.419 2.92e-11 ***
## Group2 222.77 22.48 9.909 2.00e-07 ***
## Group3 262.48 22.48 11.675 2.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.71 on 13 degrees of freedom
## Multiple R-squared: 0.9316, Adjusted R-squared: 0.9211
## F-statistic: 88.52 on 2 and 13 DF, p-value: 2.679e-08
As model summary shows, all variables of groups 2 and 3 as well as intercept of the model are significant (in confidence level of 99%).
Compute the analysis of variance table for the fitted model with anova
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 238620 119310 88.525 2.679e-08 ***
## Residuals 13 17521 1348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test shows that the ‘Group’ variable is statistically significant in confidence interval of 99%; thus, there is significant difference between Groups of rats weights.
Apply analysis from chapter 9 of textbook for the Brief Psychiatric Rating Scale (BPRS) dataset
Regression model
reg_bprs <- lm(bprs ~ week + treatment, data = BPRSL)
summary(reg_bprs)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The summary of linear model shows that the variable ‘weeks’ is significant (in confidence interval of 99% because Pr(>|t|) id <2e-16, however treatment2 is not significant because the value for it is 0.661.
Random intercept model (model#1)
reg_bprs_Ran_intercept_1 <- lmer(bprs ~ week + treatment + (1 | subject ),
data = BPRSL, REML = FALSE)
summary(reg_bprs_Ran_intercept_1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
As we can see from the summary, the estimated standard error of week (0.2084) is smaller now than previously with regression model (0.2524), which reflects the effects of considering the probable within-subject dependence applied in this Random Intercept Model. However, standard error of treatment is 1.076 now which was smaller (1.03) previously.
More importantly, we can see that the estimated mean value of BPRS with fixed effects was 46.45 which remained the same as in previous model larger. Using a rule-of-thumb of 2 for t value (introduced in this link: https://web.stanford.edu/class/psych252/section/Mixed_models_tutorial.html), we can say that t is probability significant."
#Random intercept model (model#2)
reg_bprs_Ran_intercept_2 <- lmer(bprs ~ week + treatment + treatment*week + (week | subject),
data = BPRSL, REML = FALSE)
summary(reg_bprs_Ran_intercept_2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + treatment * week + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
#Random intercept model (model#3)
reg_bprs_Ran_intercept_3 <- lmer(bprs ~ week + treatment + treatment*week + (treatment| subject),
data = BPRSL, REML = FALSE)
summary(reg_bprs_Ran_intercept_3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + treatment * week + (treatment | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2581.0 2612.1 -1282.5 2565.0 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2782 -0.6037 -0.0797 0.4437 3.3942
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.53 8.033
## treatment2 189.04 13.749 -0.56
## Residual 53.27 7.298
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.0573 23.275
## week -2.6283 0.2107 -12.475
## treatment2 -2.2911 3.3859 -0.677
## week:treatment2 0.7158 0.2980 2.402
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.410
## treatment2 -0.586 0.249
## wek:trtmnt2 0.290 -0.707 -0.352
From the summary of the three random intercept models we can use the t value to discuss that it seems treatment2 variable being always significant because it remains between -2 and +2. Taking the model 3 as an example, we can also see that there is good correlation between week and week*treatment (-0.707). Then correlation is second highest in treatment and intercept (-0.586). So, lets run analysis of variance (ANOVA) table and check from there.
Lets check ANOVA test on the models 1 and 2
anova(reg_bprs_Ran_intercept_1, reg_bprs_Ran_intercept_2)
## Data: BPRSL
## Models:
## reg_bprs_Ran_intercept_1: bprs ~ week + treatment + (1 | subject)
## reg_bprs_Ran_intercept_2: bprs ~ week + treatment + treatment * week + (week | subject)
## npar AIC BIC logLik deviance Chisq Df
## reg_bprs_Ran_intercept_1 5 2748.7 2768.1 -1369.4 2738.7
## reg_bprs_Ran_intercept_2 8 2744.3 2775.4 -1364.1 2728.3 10.443 3
## Pr(>Chisq)
## reg_bprs_Ran_intercept_1
## reg_bprs_Ran_intercept_2 0.01515 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets check ANOVA test on the models 1 and 3
anova(reg_bprs_Ran_intercept_1, reg_bprs_Ran_intercept_3)
## Data: BPRSL
## Models:
## reg_bprs_Ran_intercept_1: bprs ~ week + treatment + (1 | subject)
## reg_bprs_Ran_intercept_3: bprs ~ week + treatment + treatment * week + (treatment | subject)
## npar AIC BIC logLik deviance Chisq Df
## reg_bprs_Ran_intercept_1 5 2748.7 2768.1 -1369.4 2738.7
## reg_bprs_Ran_intercept_3 8 2581.0 2612.1 -1282.5 2565.0 173.69 3
## Pr(>Chisq)
## reg_bprs_Ran_intercept_1
## reg_bprs_Ran_intercept_3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation According to the value of Pr(>Chisq), the model 1 and model 3 pairs performed more significantly and better than model 1 and 2 pairs, because the first pair (model 1&3) gave smaller Pr(>Chisq) value which is significant in confidence interval of 99% (because the value is well below 0.01.
Moreover, the AIC value of model was smaller (which reflects the betterness of the model.
BIC is another statistical metric that we can use for assessing the models, the smaller the better. It penalizes model complexity (variables) more than AIC. Thus, the BIC value for model 3 was smallest.
Now we can plot the model fitted values with original BPRS values and calculate RMSE and Bias of the values.
Since we have original value and model fitted values for BPRS, we can calculate root mean square error (RMSE) the statistical metric that measures how far apart our fitted values are from our original values in a regression analysis, on average.
inspired from https://www.statology.org/how-to-calculate-rmse-in-r/
RMSE = sqrt(mean((RATSL$Weight - RATSL$Fitted)^2))
## Warning: Unknown or uninitialised column: `Weight`.
## Warning: Unknown or uninitialised column: `Fitted`.
# Create a vector of the fitted values
Fitted <- fitted(reg_bprs_Ran_intercept_3)
RMSE = sqrt(mean((BPRSL$bprs - Fitted)^2))
The smaller the RMSE, the better a model is able to fit the data.
RMSE_percent = 100*RMSE/mean(BPRSL$bprs)
RMSE_percent
## [1] 18.34506
We can also calculate Mean squared error (MSE). It is another most common metrics used to measure the prediction accuracy of a model. It is also called bias.
So, RMSE is 18.34%, which confirms our model is predicting the BPRS values rather good. Good job!
MSE = mean((BPRSL$bprs - Fitted)^2)
MSE
## [1] 47.72661
The lower the value for MSE, the more accurately a model is able to predict values.
MSE_percent = 100*MSE/mean(BPRSL$bprs)
MSE_percent
## [1] 126.7358
So, our model is 126% underestimating the actual BPRS value, which is very high.
plot(Fitted, BPRSL$bprs, main = "BPRS fitted vs measured",
xlab = "Model fitted BPRS", ylab = "Measured BPRS",
xlim = c(0,100), ylim=c(0,100))
abline(a = 0, b= 1, col = "blue", )
Interpretation: The blue line in graph shows the 1:1 diagonal line, for example if x = 500, y = 500, which will be in fact an ideal case.
So, that line can help us easily compare our model fitted BPRS values vs original value, because if the measured value of BPRS is for example 40, then ideally the fitted BPRS value should ideally be 40 too. As graph shows, the predicted weights were closed to original values, however, the larger BPRS value were under-estimated. This underestimation was also confirmed by the MSE of 47.72, which is rather large(126%). However, the relative RMSE was 18.34%, which is rather good!
Final notes Independence was a most important assumption in the mixed models that we used here to resolve the non-independencies in our data. Otherwise, linear regression models could also do.
Thanks for your reading my page. Please feel free to comment or ask question, and to contact me. :)
The END